genESD module

The genesd module offers the following functions to detect outliers in a univariate array using the generalized extreme Studentized deviate test:

Function

Description

genesd()

Identifies outliers in a data array.

find_ri()

Computes the Ri statistics for the generalized ESD test.

find_critvals()

Computes the critical values for the generalized ESD test.

araucaria.stats.genesd.genesd(data, r, alpha)[source]

Identifes outliers in a data array.

This function uses the generalized extreme Studentized deviate (ESD) test to detect one or more outliers in univariate data 1.

Parameters
  • data (ndarray) – Array to identify outliers.

  • r (int) – Maximum number of outliers.

  • alpha (float) – Significance level for the statistical test.

Return type

Tuple[str, list]

Returns

  • report – Report of the generalized ESD test.

  • index – Indices of outliers in the data.

Notes

The identification of outliers considers the following hypothesis test:

  • \(H_0\): there are no outliers in the data.

  • \(H_1\): there are up to \(r\) outliers in the data.

The algorithm performs the following operations:

  1. The \(R_i\) test statistics are computed for \(r\) potential outliers, removing the largest potential outlier from the data at each succesive calculation of the test statistic.

  2. The \(\lambda_i\) critical values are computed for \(r\) potential outliers, considering a significance level of \(\alpha\) for the t-distribution.

  3. Both values are compared, and the largest number of outliers where \(R_i > \lambda_i\) is accepted as the number of outliers.

References

1

Rosner, B. (1983) “Percentage Points for a Generalized ESD Many-Outlier Procedure”, Technometrics, 25(2), pp. 165-172.

Example

>>> # calculating outliers for Rosner data (1983):
>>> from numpy import loadtxt, allclose
>>> from araucaria.testdata import get_testpath
>>> from araucaria.stats import genesd
>>> path  = get_testpath('rosner.dat')
>>> data  = loadtxt(path)
>>> r     = 5
>>> alpha = 0.05
>>> report, index = genesd(data, r, alpha)
>>> print(report)
Generalized ESD test for outliers
  H0: there are no outliers in the data
  H1: there are up to 5 outliers in the data
  Significance level:  alpha = 0.05
  Critical region:  Reject H0 if R_i > lambda_i
=====================================
n outliers  x_i    R_i     lambda_i
=====================================
1           6.01   3.1189  3.1588
2           5.42   2.943   3.1514
3           5.34   3.1794  3.1439    *
4           4.64   2.8102  3.1362
5           -0.25  2.8156  3.1282
=====================================
>>> print(data[index])
[6.01 5.42 5.34]
araucaria.stats.genesd.find_ri(data, r)[source]

Computes the \(R_i\) test statistics for the generalized extreme Studentized deviate (ESD) test.

Parameters
  • data (ndarray) – Array to compute test statistic.

  • r (int) – Maximum number of outliers.

Return type

Tuple[float, float]

Returns

  • Test statistic for the generalized ESD test.

  • Value of data points furthest from the mean.

Notes

The \(R_i\) test statistics are calculated as follows:

\[R_i = \frac{\textrm{max} | x_i - \bar{x}_{n-i+1}| }{s_{n-i+1}} \quad i \in \{1,2, \dots, r \}\]

Where

  • \(\bar{x}_{n-i+1}\): sample mean of reduced array.

  • \(s_{n-i+1}\) : sample standard deviation of reduced array.

  • \(n-i+1\) : number of points in the reduced array.

  • \(r\) : maximum number of outliers.

After each calculation rhe observation that maximizes \(| x_i − \bar{x} |\) is removed, and \(R_i\) is computed with n - i + 1 observations. This procedure is repeated until r observations have been removed from the array.

Example

>>> # calculating test statistics from Rosner's data (1983):
>>> from numpy import loadtxt
>>> from araucaria.testdata import get_testpath
>>> from araucaria.stats import find_ri
>>> path  = get_testpath('rosner.dat')
>>> data  = loadtxt(path)
>>> r     = 5
>>> ri,xi = find_ri(data,r)
>>> for val in ri:
...     print('%1.3f' % val)
3.119
2.943
3.179
2.810
2.816
araucaria.stats.genesd.find_critvals(n, r, alpha)[source]

Computes critical values \(\lambda_i\) for the generalized extreme Studentized deviate (ESD) test.

Parameters
  • n (int) – Number of data points.

  • r (int) – Maximum number of outliers.

  • alpha (float) – Significance level for the statistical test.

Return type

list

Returns

Critical values.

Notes

The \(\lambda_i\) values are calculated as follows:

\[\lambda_i = \frac{ (n-i)\ t_{p, n-i-1} }{ \sqrt{(n-i-1-t_{n-i-1}^2)(n-i+1)} } \quad i \in \{1,2, \dots, r \}\]
\[p = 1 - \frac{\alpha}{2(n-i+1)}\]

Where

  • \(n\) : number of points in the array.

  • \(\alpha\) : significance level.

  • \(t_{p,v}\) : percent point function of the t-distribution at \(p\) value and \(v\) degrees of freedom.

  • \(r\) : maximum number of outliers.

Example

>>> from araucaria.stats import find_critvals
>>> n     = 54    # number of points
>>> r     = 5     # max number of outliers
>>> alpha = 0.05  # significance level
>>> lambd = find_critvals(n, r, alpha)
>>> for val in lambd:
...     print('%1.3f' % val)
3.159
3.151
3.144
3.136
3.128