PCA module¶

Principal component analysis (PCA) is an exploratory data analysis technique that allows to reduce the dimensionality of a dataset while preserving most of its variance. Dimensionality reduction is achieved by changing the basis of the dataset in such way that the new vectors constitute an orthonormal basis.

Lets consider a m by n matrix \(X\) (m observations and n variables). Mathematically, PCA computes the eigenvalues and eigenvectors of the covariance matrix. If observations in \(X\) are centered, then the covariance is proportional to \(X^TX\):

\[eigen(covX) \propto eigen(X^TX)\]

Following the eigen-decomposition, we can rewrite \(X^TX\) as follows:

\[X^TX = WDW^{-1}\]

Where

\(W\) : matrix of eigenvectors.
\(D\) : diagonal matrix of eigenvalues.

Multiplying the matrix \(X\) by the matrix of eigenvectors \(W\) effectively projects the data into an orthonormal basis: \(T=XW\) are known as the principal components.

Although PCA can be computed directly as the eigenvalues of the covariance matrix, it is often convenient to compute it through singular value decomposition (SVD) of \(X\):

\[X = U \Sigma V^*\]

Where:

\(U\): m by n semi-unitary matrix of left-singular vectors.
\(\Sigma\): n by n diagonal matrix of singular values.
\(V\): n by n semi-unitary matrix of right-singular vectors.

It is straightforward to observe the following relation between the eigenvalue decomposition of the covariance matrix and the SVD of \(X\):

\[ \begin{align}\begin{aligned}X^TX &= V \Sigma^T \Sigma V^* = V \hat{\Sigma}^2 V^*\\X^TX &= WDW^{-1} = V \hat{\Sigma}^2 V^*\end{aligned}\end{align} \]

Where

\(W\) : matrix of eigenvectors.
\(\hat{\Sigma^2} = \Sigma^T \Sigma\): diagonal matrix of eigenvalues.

Therefore, the principal components can be computed as \(T=XV=U \Sigma\).

Dimensionality reduction¶

A truncated matrix \(T_L\) can be computed by retaining the L-largest singular values and their corresponding singular vectors:

\[T_L = U_L \Sigma_L = X V_L\]

After reduction a new vector \(x\) of observations can be projected into the principal components:

\[t = U_L^T x\]

Target transformation¶

These principal components can be transformed back into the original space:

\[\hat{x} = U_L t = U_L U_L^T x\]

The latter is commonly referred to as target transformation.

The pca module offers the following classes and functions to perform principal component analysis based on the SVD approach:

Class	Description
`PCAModel`	Container of results from PCA.

Function	Description
`pca()`	Performs principal component analysis on a collection.
`target_transform()`	Performs target transformation on a collection.

class araucaria.stats.pca.PCAModel(matrix, name=None, ncomps=None, cumvar=None)[source]¶

PCA Model class.

“This class stores the results of principal component analysis on an array.

Parameters

matrix (ndarray) – m by n array containing m observations and n variables.
name (str) – Name for the collection. The default is None.
ncomps (int) – Number of components to preserve. Default is None, which will preserve all components.
cumvar (float) – Cumulative variance to preserve. Default is None, which will preserve all components.

Raises

AssertionError – If ncomps is not a positive integer.
AssertionError – If ncomps is larger than the number of variables in matrix.
AssertionError – If cumvar is negative or greater than 1.

matrix¶

Array containing m observations in rows and n variables in columns.

Type: ndarray

U¶

m by n array with the semi-unitary matrix of left-singular vectors.

Type: ndarray

s¶

1-D array with the n singular values.

Type: ndarray

Vh¶

n by n array with the transpose of the semi-unitary matrix of right-singular vectors.

Type: ndarray

variance¶

Array with the explained variance of each component.

Type: ndarray

components¶

m by n array with principal components.

Type: ndarray

Notes

The following methods are currently implemented:

Method	Description
`transform()`	Projects observations into principal components.
`inverse_transform()`	Converts principal component values into observations.

Important

Observations in matrix must be centered in order to perform PCA.
Either ncomps or cumvar can be set to reduce dimensionallity of the dataset. If ncomps is provided, it will set precedence over cumvar.

Example

>>> from numpy.random import randn
>>> from araucaria.stats import PCAModel
>>> from araucaria.utils import check_objattrs
>>> matrix = randn(10,10)
>>> model  = PCAModel(matrix)
>>> type(model)
<class 'araucaria.stats.pca.PCAModel'>

>>> # verifying attributes
>>> attrs = ['matrix', 'components', 'variance']
>>> check_objattrs(model, PCAModel, attrs)
[True, True, True]

transform(obs)[source]¶

Projects observations into principal components.

Parameters: obs (ndarray) – Array with observed values.
Return type: ndarray
Returns: Array with scores on principal components.

inverse_transform(p)[source]¶

Converts principal components into observations.

Parameters: p (ndarray) – Array with scores on principal components.
Return type: ndarray
Returns: Array with observed values.

araucaria.stats.pca.pca(collection, taglist=['all'], pca_region='xanes', pca_range=[- inf, inf], ncomps=None, cumvar=None, kweight=2)[source]¶

Performs principal component analysis (PCA) on a collection.

Parameters

collection (Collection) – Collection with the groups for PCA.
taglist (List[str]) – List with keys to filter groups based on their tags attributes in the Collection. The default is [‘all’].
pca_region (str) – XAFS region to perform PCA. Accepted values are ‘dxanes’, ‘xanes’, or ‘exafs’. The default is ‘xanes’.
pca_range (list) – Domain range in absolute values. Energy units are expected for ‘dxanes’ or ‘xanes’, while wavenumber (k) units are expected for ‘exafs’. The default is [-inf, inf].
ncomps (Optional[int]) – Number of components to preserve from the PCA. Default is None, which will preserve all components.
cumvar (Optional[float]) – Cumulative variance to preserve from the PCA. Defaults is None, which will preserve all components.
kweight (int) – Exponent for weighting chi(k) by k^kweight. Only valid for cluster_region='exafs'. The default is 2.

Return type

PCAModel

Returns

PCA model with the following arguments:

components : array with principal components.
variance : array with explained variance of each component.
groupnames : list with names of clustered groups.
energy : array with energy values. Returned only if pca_region='xanes or pca_region=dxanes.
k : array with wavenumber values. Returned only if pca_region='exafs'.
matrix : array with centered values for groups in pca_range.
pca_pars : dictionary with PCA parameters.

See also

PCAModel: Class to store results from principal component analysis.
fig_pca(): Plots the results of principal component analysis.

Important

Group datasets in collection will be centered before performing PCA.
Either ncomps or cumvar can be set to reduce dimensionallity of the dataset. If ncomps is provided, it will set precedence over cumvar.

Example

>>> from araucaria.testdata import get_testpath
>>> from araucaria.io import read_collection_hdf5
>>> from araucaria.xas import pre_edge
>>> from araucaria.stats import PCAModel, pca
>>> from araucaria.utils import check_objattrs
>>> fpath      = get_testpath('Fe_database.h5')
>>> collection = read_collection_hdf5(fpath)
>>> collection.apply(pre_edge)
>>> out        = pca(collection, pca_region='xanes')
>>> attrs      = ['energy', 'matrix', 'components', 'variance', 'groupnames', 'pca_pars']
>>> check_objattrs(out, PCAModel, attrs)
[True, True, True, True, True, True]

araucaria.stats.pca.target_transform(model, collection, taglist=['all'])[source]¶

Performs target transformation on a collection.

Parameters

model (PCAModel) – PCA model to perform the projection and inverse transformation.
collection (Collection) – Collection with the groups for target transformatino.
taglist (List[str]) – List with keys to filter groups based on their tags attributes in the Collection. The default is [‘all’].

Return type

Dataset

Returns

Dataset with the following attributes.

groupnames: list with names of transformed groups.
energy : array with energy values. Returned only if pca_region='xanes or pca_region=dxanes.
k : array with wavenumber values. Returned only if pca_region='exafs'.
matrix : original array with mapped values.
tmatrix : array with target transformed groups.
scores : array with scores in the principal component basis.
chi2 : \(\chi^2\) values of the target tranformed groups.

Raises

TypeError – If model is not a valid PCAModel instance
KeyError – If attributes from pca() do not exist in model.