genESD module¶

The genesd module offers the following functions to detect outliers in a univariate array using the generalized extreme Studentized deviate test:

Function	Description
`genesd()`	Identifies outliers in a data array.
`find_ri()`	Computes the Ri statistics for the generalized ESD test.
`find_critvals()`	Computes the critical values for the generalized ESD test.

araucaria.stats.genesd.genesd(data, r, alpha)[source]¶

Identifes outliers in a data array.

This function uses the generalized extreme Studentized deviate (ESD) test to detect one or more outliers in univariate data 1.

Parameters

data (ndarray) – Array to identify outliers.
r (int) – Maximum number of outliers.
alpha (float) – Significance level for the statistical test.

Return type

Tuple[str, list]

Returns

report – Report of the generalized ESD test.
index – Indices of outliers in the data.

Notes

The identification of outliers considers the following hypothesis test:

\(H_0\): there are no outliers in the data.
\(H_1\): there are up to \(r\) outliers in the data.

The algorithm performs the following operations:

The \(R_i\) test statistics are computed for \(r\) potential outliers, removing the largest potential outlier from the data at each succesive calculation of the test statistic.
The \(\lambda_i\) critical values are computed for \(r\) potential outliers, considering a significance level of \(\alpha\) for the t-distribution.
Both values are compared, and the largest number of outliers where \(R_i > \lambda_i\) is accepted as the number of outliers.

References

1: Rosner, B. (1983) “Percentage Points for a Generalized ESD Many-Outlier Procedure”, Technometrics, 25(2), pp. 165-172.

Example

>>> # calculating outliers for Rosner data (1983):
>>> from numpy import loadtxt, allclose
>>> from araucaria.testdata import get_testpath
>>> from araucaria.stats import genesd
>>> path  = get_testpath('rosner.dat')
>>> data  = loadtxt(path)
>>> r     = 5
>>> alpha = 0.05
>>> report, index = genesd(data, r, alpha)
>>> print(report)
Generalized ESD test for outliers
  H0: there are no outliers in the data
  H1: there are up to 5 outliers in the data
  Significance level:  alpha = 0.05
  Critical region:  Reject H0 if R_i > lambda_i
=====================================
n outliers  x_i    R_i     lambda_i
=====================================
1           6.01   3.1189  3.1588
2           5.42   2.943   3.1514
3           5.34   3.1794  3.1439    *
4           4.64   2.8102  3.1362
5           -0.25  2.8156  3.1282
=====================================
>>> print(data[index])
[6.01 5.42 5.34]

araucaria.stats.genesd.find_ri(data, r)[source]¶

Computes the \(R_i\) test statistics for the generalized extreme Studentized deviate (ESD) test.

Parameters

data (ndarray) – Array to compute test statistic.
r (int) – Maximum number of outliers.

Return type

Tuple[float, float]

Returns

Test statistic for the generalized ESD test.
Value of data points furthest from the mean.

Notes

The \(R_i\) test statistics are calculated as follows:

\[R_i = \frac{\textrm{max} | x_i - \bar{x}_{n-i+1}| }{s_{n-i+1}} \quad i \in \{1,2, \dots, r \}\]

Where

\(\bar{x}_{n-i+1}\): sample mean of reduced array.
\(s_{n-i+1}\) : sample standard deviation of reduced array.
\(n-i+1\) : number of points in the reduced array.
\(r\) : maximum number of outliers.

After each calculation rhe observation that maximizes \(| x_i − \bar{x} |\) is removed, and \(R_i\) is computed with n - i + 1 observations. This procedure is repeated until r observations have been removed from the array.

Example

>>> # calculating test statistics from Rosner's data (1983):
>>> from numpy import loadtxt
>>> from araucaria.testdata import get_testpath
>>> from araucaria.stats import find_ri
>>> path  = get_testpath('rosner.dat')
>>> data  = loadtxt(path)
>>> r     = 5
>>> ri,xi = find_ri(data,r)
>>> for val in ri:
...     print('%1.3f' % val)
3.119
2.943
3.179
2.810
2.816

araucaria.stats.genesd.find_critvals(n, r, alpha)[source]¶

Computes critical values \(\lambda_i\) for the generalized extreme Studentized deviate (ESD) test.

Parameters

n (int) – Number of data points.
r (int) – Maximum number of outliers.
alpha (float) – Significance level for the statistical test.

Return type

list

Returns

Critical values.

Notes

The \(\lambda_i\) values are calculated as follows:

\[\lambda_i = \frac{ (n-i)\ t_{p, n-i-1} }{ \sqrt{(n-i-1-t_{n-i-1}^2)(n-i+1)} } \quad i \in \{1,2, \dots, r \}\]

\[p = 1 - \frac{\alpha}{2(n-i+1)}\]

Where

\(n\) : number of points in the array.
\(\alpha\) : significance level.
\(t_{p,v}\) : percent point function of the t-distribution at \(p\) value and \(v\) degrees of freedom.
\(r\) : maximum number of outliers.

Example

>>> from araucaria.stats import find_critvals
>>> n     = 54    # number of points
>>> r     = 5     # max number of outliers
>>> alpha = 0.05  # significance level
>>> lambd = find_critvals(n, r, alpha)
>>> for val in lambd:
...     print('%1.3f' % val)
3.159
3.151
3.144
3.136
3.128