PCA module¶
Principal component analysis (PCA) is an exploratory data analysis technique that allows to reduce the dimensionality of a dataset while preserving most of its variance. Dimensionality reduction is achieved by changing the basis of the dataset in such way that the new vectors constitute an orthonormal basis.
Lets consider a m by n matrix \(X\) (m observations and n variables). Mathematically, PCA computes the eigenvalues and eigenvectors of the covariance matrix. If observations in \(X\) are centered, then the covariance is proportional to \(X^TX\):
Following the eigen-decomposition, we can rewrite \(X^TX\) as follows:
Where
\(W\) : matrix of eigenvectors.
\(D\) : diagonal matrix of eigenvalues.
Multiplying the matrix \(X\) by the matrix of eigenvectors \(W\) effectively projects the data into an orthonormal basis: \(T=XW\) are known as the principal components.
Although PCA can be computed directly as the eigenvalues of the covariance matrix, it is often convenient to compute it through singular value decomposition (SVD) of \(X\):
Where:
\(U\): m by n semi-unitary matrix of left-singular vectors.
\(\Sigma\): n by n diagonal matrix of singular values.
\(V\): n by n semi-unitary matrix of right-singular vectors.
It is straightforward to observe the following relation between the eigenvalue decomposition of the covariance matrix and the SVD of \(X\):
Where
\(W\) : matrix of eigenvectors.
\(\hat{\Sigma^2} = \Sigma^T \Sigma\): diagonal matrix of eigenvalues.
Therefore, the principal components can be computed as \(T=XV=U \Sigma\).
Dimensionality reduction¶
A truncated matrix \(T_L\) can be computed by retaining the L-largest singular values and their corresponding singular vectors:
After reduction a new vector \(x\) of observations can be projected into the principal components:
Target transformation¶
These principal components can be transformed back into the original space:
The latter is commonly referred to as target transformation.
The pca
module offers the following
classes and functions to perform principal component analysis based on the SVD
approach:
Class |
Description |
---|---|
Container of results from PCA. |
Function |
Description |
---|---|
Performs principal component analysis on a collection. |
|
Performs target transformation on a collection. |
- class araucaria.stats.pca.PCAModel(matrix, name=None, ncomps=None, cumvar=None)[source]¶
PCA Model class.
“This class stores the results of principal component analysis on an array.
- Parameters
matrix (
ndarray
) – m by n array containing m observations and n variables.name (
str
) – Name for the collection. The default is None.ncomps (
int
) – Number of components to preserve. Default is None, which will preserve all components.cumvar (
float
) – Cumulative variance to preserve. Default is None, which will preserve all components.
- Raises
AssertionError – If
ncomps
is not a positive integer.AssertionError – If
ncomps
is larger than the number of variables inmatrix
.AssertionError – If
cumvar
is negative or greater than 1.
- Vh¶
n by n array with the transpose of the semi-unitary matrix of right-singular vectors.
- Type
Notes
The following methods are currently implemented:
Method
Description
Projects observations into principal components.
Converts principal component values into observations.
Important
Observations in
matrix
must be centered in order to perform PCA.Either
ncomps
orcumvar
can be set to reduce dimensionallity of the dataset. Ifncomps
is provided, it will set precedence overcumvar
.
Example
>>> from numpy.random import randn >>> from araucaria.stats import PCAModel >>> from araucaria.utils import check_objattrs >>> matrix = randn(10,10) >>> model = PCAModel(matrix) >>> type(model) <class 'araucaria.stats.pca.PCAModel'>
>>> # verifying attributes >>> attrs = ['matrix', 'components', 'variance'] >>> check_objattrs(model, PCAModel, attrs) [True, True, True]
- araucaria.stats.pca.pca(collection, taglist=['all'], pca_region='xanes', pca_range=[- inf, inf], ncomps=None, cumvar=None, kweight=2)[source]¶
Performs principal component analysis (PCA) on a collection.
- Parameters
collection (
Collection
) – Collection with the groups for PCA.taglist (
List
[str
]) – List with keys to filter groups based on theirtags
attributes in the Collection. The default is [‘all’].pca_region (
str
) – XAFS region to perform PCA. Accepted values are ‘dxanes’, ‘xanes’, or ‘exafs’. The default is ‘xanes’.pca_range (
list
) – Domain range in absolute values. Energy units are expected for ‘dxanes’ or ‘xanes’, while wavenumber (k) units are expected for ‘exafs’. The default is [-inf
,inf
].ncomps (
Optional
[int
]) – Number of components to preserve from the PCA. Default is None, which will preserve all components.cumvar (
Optional
[float
]) – Cumulative variance to preserve from the PCA. Defaults is None, which will preserve all components.kweight (
int
) – Exponent for weighting chi(k) by k^kweight. Only valid forcluster_region='exafs'
. The default is 2.
- Return type
- Returns
PCA model with the following arguments:
components
: array with principal components.variance
: array with explained variance of each component.groupnames
: list with names of clustered groups.energy
: array with energy values. Returned only ifpca_region='xanes
orpca_region=dxanes
.k
: array with wavenumber values. Returned only ifpca_region='exafs'
.matrix
: array with centered values for groups inpca_range
.pca_pars
: dictionary with PCA parameters.
See also
PCAModel
Class to store results from principal component analysis.
fig_pca()
Plots the results of principal component analysis.
Important
Group datasets in
collection
will be centered before performing PCA.Either
ncomps
orcumvar
can be set to reduce dimensionallity of the dataset. Ifncomps
is provided, it will set precedence overcumvar
.
Example
>>> from araucaria.testdata import get_testpath >>> from araucaria.io import read_collection_hdf5 >>> from araucaria.xas import pre_edge >>> from araucaria.stats import PCAModel, pca >>> from araucaria.utils import check_objattrs >>> fpath = get_testpath('Fe_database.h5') >>> collection = read_collection_hdf5(fpath) >>> collection.apply(pre_edge) >>> out = pca(collection, pca_region='xanes') >>> attrs = ['energy', 'matrix', 'components', 'variance', 'groupnames', 'pca_pars'] >>> check_objattrs(out, PCAModel, attrs) [True, True, True, True, True, True]
- araucaria.stats.pca.target_transform(model, collection, taglist=['all'])[source]¶
Performs target transformation on a collection.
- Parameters
model (
PCAModel
) – PCA model to perform the projection and inverse transformation.collection (
Collection
) – Collection with the groups for target transformatino.taglist (
List
[str
]) – List with keys to filter groups based on theirtags
attributes in the Collection. The default is [‘all’].
- Return type
- Returns
Dataset with the following attributes.
groupnames
: list with names of transformed groups.energy
: array with energy values. Returned only ifpca_region='xanes
orpca_region=dxanes
.k
: array with wavenumber values. Returned only ifpca_region='exafs'
.matrix
: original array with mapped values.tmatrix
: array with target transformed groups.scores
: array with scores in the principal component basis.chi2
: \(\chi^2\) values of the target tranformed groups.
- Raises
See also
pca()
Performs principal component analysis on a collection.
fig_target_transform()
Plots the results of target transformation.
Example
>>> from araucaria.testdata import get_testpath >>> from araucaria import Dataset >>> from araucaria.io import read_collection_hdf5 >>> from araucaria.xas import pre_edge >>> from araucaria.stats import pca, target_transform >>> from araucaria.utils import check_objattrs >>> fpath = get_testpath('Fe_database.h5') >>> collection = read_collection_hdf5(fpath) >>> collection.apply(pre_edge) >>> model = pca(collection, pca_region='xanes', cumvar=0.9) >>> data = target_transform(model, collection) >>> attrs = ['groupnames', 'tmatrix', 'chi2', 'scores', 'energy'] >>> check_objattrs(data, Dataset, attrs) [True, True, True, True, True]