genESD module¶
The genesd
module offers the following
functions to detect outliers in a univariate array using the
generalized extreme Studentized deviate test:
Function |
Description |
---|---|
Identifies outliers in a data array. |
|
Computes the Ri statistics for the generalized ESD test. |
|
Computes the critical values for the generalized ESD test. |
- araucaria.stats.genesd.genesd(data, r, alpha)[source]¶
Identifes outliers in a data array.
This function uses the generalized extreme Studentized deviate (ESD) test to detect one or more outliers in univariate data 1.
- Parameters
- Return type
- Returns
report – Report of the generalized ESD test.
index – Indices of outliers in the data.
Notes
The identification of outliers considers the following hypothesis test:
\(H_0\): there are no outliers in the data.
\(H_1\): there are up to \(r\) outliers in the data.
The algorithm performs the following operations:
The \(R_i\) test statistics are computed for \(r\) potential outliers, removing the largest potential outlier from the data at each succesive calculation of the test statistic.
The \(\lambda_i\) critical values are computed for \(r\) potential outliers, considering a significance level of \(\alpha\) for the t-distribution.
Both values are compared, and the largest number of outliers where \(R_i > \lambda_i\) is accepted as the number of outliers.
References
- 1
Rosner, B. (1983) “Percentage Points for a Generalized ESD Many-Outlier Procedure”, Technometrics, 25(2), pp. 165-172.
Example
>>> # calculating outliers for Rosner data (1983): >>> from numpy import loadtxt, allclose >>> from araucaria.testdata import get_testpath >>> from araucaria.stats import genesd >>> path = get_testpath('rosner.dat') >>> data = loadtxt(path) >>> r = 5 >>> alpha = 0.05 >>> report, index = genesd(data, r, alpha) >>> print(report) Generalized ESD test for outliers H0: there are no outliers in the data H1: there are up to 5 outliers in the data Significance level: alpha = 0.05 Critical region: Reject H0 if R_i > lambda_i ===================================== n outliers x_i R_i lambda_i ===================================== 1 6.01 3.1189 3.1588 2 5.42 2.943 3.1514 3 5.34 3.1794 3.1439 * 4 4.64 2.8102 3.1362 5 -0.25 2.8156 3.1282 ===================================== >>> print(data[index]) [6.01 5.42 5.34]
- araucaria.stats.genesd.find_ri(data, r)[source]¶
Computes the \(R_i\) test statistics for the generalized extreme Studentized deviate (ESD) test.
- Parameters
- Return type
- Returns
Test statistic for the generalized ESD test.
Value of data points furthest from the mean.
Notes
The \(R_i\) test statistics are calculated as follows:
\[R_i = \frac{\textrm{max} | x_i - \bar{x}_{n-i+1}| }{s_{n-i+1}} \quad i \in \{1,2, \dots, r \}\]Where
\(\bar{x}_{n-i+1}\): sample mean of reduced array.
\(s_{n-i+1}\) : sample standard deviation of reduced array.
\(n-i+1\) : number of points in the reduced array.
\(r\) : maximum number of outliers.
After each calculation rhe observation that maximizes \(| x_i − \bar{x} |\) is removed, and \(R_i\) is computed with n - i + 1 observations. This procedure is repeated until r observations have been removed from the array.
Example
>>> # calculating test statistics from Rosner's data (1983): >>> from numpy import loadtxt >>> from araucaria.testdata import get_testpath >>> from araucaria.stats import find_ri >>> path = get_testpath('rosner.dat') >>> data = loadtxt(path) >>> r = 5 >>> ri,xi = find_ri(data,r) >>> for val in ri: ... print('%1.3f' % val) 3.119 2.943 3.179 2.810 2.816
- araucaria.stats.genesd.find_critvals(n, r, alpha)[source]¶
Computes critical values \(\lambda_i\) for the generalized extreme Studentized deviate (ESD) test.
- Parameters
- Return type
- Returns
Critical values.
Notes
The \(\lambda_i\) values are calculated as follows:
\[\lambda_i = \frac{ (n-i)\ t_{p, n-i-1} }{ \sqrt{(n-i-1-t_{n-i-1}^2)(n-i+1)} } \quad i \in \{1,2, \dots, r \}\]\[p = 1 - \frac{\alpha}{2(n-i+1)}\]Where
\(n\) : number of points in the array.
\(\alpha\) : significance level.
\(t_{p,v}\) : percent point function of the t-distribution at \(p\) value and \(v\) degrees of freedom.
\(r\) : maximum number of outliers.
Example
>>> from araucaria.stats import find_critvals >>> n = 54 # number of points >>> r = 5 # max number of outliers >>> alpha = 0.05 # significance level >>> lambd = find_critvals(n, r, alpha) >>> for val in lambd: ... print('%1.3f' % val) 3.159 3.151 3.144 3.136 3.128