genESD module

The genesd module offers the following functions to detect outliers in a univariate array using the generalized extreme Studentized deviate test:

Function

Description

genesd()

Identifies outliers in a data array.

find_ri()

Computes the Ri statistics for the generalized ESD test.

find_critvals()

Computes the critical values for the generalized ESD test.

araucaria.stats.genesd.genesd(data, r, alpha)[source]

Identifes outliers in a data array.

This function uses the generalized extreme Studentized deviate (ESD) test to detect one or more outliers in univariate data 1.

Parameters
  • data (ndarray) – Array to identify outliers.

  • r (int) – Maximum number of outliers.

  • alpha (float) – Significance level for the statistical test.

Return type

Tuple[str, list]

Returns

  • report – Report of the generalized ESD test.

  • index – Indices of outliers in the data.

Notes

The identification of outliers considers the following hypothesis test:

  • H0: there are no outliers in the data.

  • H1: there are up to r outliers in the data.

The algorithm performs the following operations:

  1. The Ri test statistics are computed for r potential outliers, removing the largest potential outlier from the data at each succesive calculation of the test statistic.

  2. The λi critical values are computed for r potential outliers, considering a significance level of α for the t-distribution.

  3. Both values are compared, and the largest number of outliers where Ri>λi is accepted as the number of outliers.

References

1

Rosner, B. (1983) “Percentage Points for a Generalized ESD Many-Outlier Procedure”, Technometrics, 25(2), pp. 165-172.

Example

>>> # calculating outliers for Rosner data (1983):
>>> from numpy import loadtxt, allclose
>>> from araucaria.testdata import get_testpath
>>> from araucaria.stats import genesd
>>> path  = get_testpath('rosner.dat')
>>> data  = loadtxt(path)
>>> r     = 5
>>> alpha = 0.05
>>> report, index = genesd(data, r, alpha)
>>> print(report)
Generalized ESD test for outliers
  H0: there are no outliers in the data
  H1: there are up to 5 outliers in the data
  Significance level:  alpha = 0.05
  Critical region:  Reject H0 if R_i > lambda_i
=====================================
n outliers  x_i    R_i     lambda_i
=====================================
1           6.01   3.1189  3.1588
2           5.42   2.943   3.1514
3           5.34   3.1794  3.1439    *
4           4.64   2.8102  3.1362
5           -0.25  2.8156  3.1282
=====================================
>>> print(data[index])
[6.01 5.42 5.34]
Copy to clipboard
araucaria.stats.genesd.find_ri(data, r)[source]

Computes the Ri test statistics for the generalized extreme Studentized deviate (ESD) test.

Parameters
  • data (ndarray) – Array to compute test statistic.

  • r (int) – Maximum number of outliers.

Return type

Tuple[float, float]

Returns

  • Test statistic for the generalized ESD test.

  • Value of data points furthest from the mean.

Notes

The Ri test statistics are calculated as follows:

Ri=max|xix¯ni+1|sni+1i{1,2,,r}

Where

  • x¯ni+1: sample mean of reduced array.

  • sni+1 : sample standard deviation of reduced array.

  • ni+1 : number of points in the reduced array.

  • r : maximum number of outliers.

After each calculation rhe observation that maximizes |xix¯| is removed, and Ri is computed with n - i + 1 observations. This procedure is repeated until r observations have been removed from the array.

Example

>>> # calculating test statistics from Rosner's data (1983):
>>> from numpy import loadtxt
>>> from araucaria.testdata import get_testpath
>>> from araucaria.stats import find_ri
>>> path  = get_testpath('rosner.dat')
>>> data  = loadtxt(path)
>>> r     = 5
>>> ri,xi = find_ri(data,r)
>>> for val in ri:
...     print('%1.3f' % val)
3.119
2.943
3.179
2.810
2.816
Copy to clipboard
araucaria.stats.genesd.find_critvals(n, r, alpha)[source]

Computes critical values λi for the generalized extreme Studentized deviate (ESD) test.

Parameters
  • n (int) – Number of data points.

  • r (int) – Maximum number of outliers.

  • alpha (float) – Significance level for the statistical test.

Return type

list

Returns

Critical values.

Notes

The λi values are calculated as follows:

λi=(ni) tp,ni1(ni1tni12)(ni+1)i{1,2,,r}
p=1α2(ni+1)

Where

  • n : number of points in the array.

  • α : significance level.

  • tp,v : percent point function of the t-distribution at p value and v degrees of freedom.

  • r : maximum number of outliers.

Example

>>> from araucaria.stats import find_critvals
>>> n     = 54    # number of points
>>> r     = 5     # max number of outliers
>>> alpha = 0.05  # significance level
>>> lambd = find_critvals(n, r, alpha)
>>> for val in lambd:
...     print('%1.3f' % val)
3.159
3.151
3.144
3.136
3.128
Copy to clipboard