PCA module

Principal component analysis (PCA) is an exploratory data analysis technique that allows to reduce the dimensionality of a dataset while preserving most of its variance. Dimensionality reduction is achieved by changing the basis of the dataset in such way that the new vectors constitute an orthonormal basis.

Lets consider a m by n matrix \(X\) (m observations and n variables). Mathematically, PCA computes the eigenvalues and eigenvectors of the covariance matrix. If observations in \(X\) are centered, then the covariance is proportional to \(X^TX\):

\[eigen(covX) \propto eigen(X^TX)\]

Following the eigen-decomposition, we can rewrite \(X^TX\) as follows:

\[X^TX = WDW^{-1}\]

Where

  • \(W\) : matrix of eigenvectors.

  • \(D\) : diagonal matrix of eigenvalues.

Multiplying the matrix \(X\) by the matrix of eigenvectors \(W\) effectively projects the data into an orthonormal basis: \(T=XW\) are known as the principal components.

Although PCA can be computed directly as the eigenvalues of the covariance matrix, it is often convenient to compute it through singular value decomposition (SVD) of \(X\):

\[X = U \Sigma V^*\]

Where:

  • \(U\): m by n semi-unitary matrix of left-singular vectors.

  • \(\Sigma\): n by n diagonal matrix of singular values.

  • \(V\): n by n semi-unitary matrix of right-singular vectors.

It is straightforward to observe the following relation between the eigenvalue decomposition of the covariance matrix and the SVD of \(X\):

\[ \begin{align}\begin{aligned}X^TX &= V \Sigma^T \Sigma V^* = V \hat{\Sigma}^2 V^*\\X^TX &= WDW^{-1} = V \hat{\Sigma}^2 V^*\end{aligned}\end{align} \]

Where

  • \(W\) : matrix of eigenvectors.

  • \(\hat{\Sigma^2} = \Sigma^T \Sigma\): diagonal matrix of eigenvalues.

Therefore, the principal components can be computed as \(T=XV=U \Sigma\).

Dimensionality reduction

A truncated matrix \(T_L\) can be computed by retaining the L-largest singular values and their corresponding singular vectors:

\[T_L = U_L \Sigma_L = X V_L\]

After reduction a new vector \(x\) of observations can be projected into the principal components:

\[t = U_L^T x\]

Target transformation

These principal components can be transformed back into the original space:

\[\hat{x} = U_L t = U_L U_L^T x\]

The latter is commonly referred to as target transformation.

The pca module offers the following classes and functions to perform principal component analysis based on the SVD approach:

Class

Description

PCAModel

Container of results from PCA.

Function

Description

pca()

Performs principal component analysis on a collection.

target_transform()

Performs target transformation on a collection.

class araucaria.stats.pca.PCAModel(matrix, name=None, ncomps=None, cumvar=None)[source]

PCA Model class.

“This class stores the results of principal component analysis on an array.

Parameters
  • matrix (ndarray) – m by n array containing m observations and n variables.

  • name (str) – Name for the collection. The default is None.

  • ncomps (int) – Number of components to preserve. Default is None, which will preserve all components.

  • cumvar (float) – Cumulative variance to preserve. Default is None, which will preserve all components.

Raises
matrix

Array containing m observations in rows and n variables in columns.

Type

ndarray

U

m by n array with the semi-unitary matrix of left-singular vectors.

Type

ndarray

s

1-D array with the n singular values.

Type

ndarray

Vh

n by n array with the transpose of the semi-unitary matrix of right-singular vectors.

Type

ndarray

variance

Array with the explained variance of each component.

Type

ndarray

components

m by n array with principal components.

Type

ndarray

Notes

The following methods are currently implemented:

Method

Description

transform()

Projects observations into principal components.

inverse_transform()

Converts principal component values into observations.

Important

  • Observations in matrix must be centered in order to perform PCA.

  • Either ncomps or cumvar can be set to reduce dimensionallity of the dataset. If ncomps is provided, it will set precedence over cumvar.

Example

>>> from numpy.random import randn
>>> from araucaria.stats import PCAModel
>>> from araucaria.utils import check_objattrs
>>> matrix = randn(10,10)
>>> model  = PCAModel(matrix)
>>> type(model)
<class 'araucaria.stats.pca.PCAModel'>
>>> # verifying attributes
>>> attrs = ['matrix', 'components', 'variance']
>>> check_objattrs(model, PCAModel, attrs)
[True, True, True]
transform(obs)[source]

Projects observations into principal components.

Parameters

obs (ndarray) – Array with observed values.

Return type

ndarray

Returns

Array with scores on principal components.

inverse_transform(p)[source]

Converts principal components into observations.

Parameters

p (ndarray) – Array with scores on principal components.

Return type

ndarray

Returns

Array with observed values.

araucaria.stats.pca.pca(collection, taglist=['all'], pca_region='xanes', pca_range=[- inf, inf], ncomps=None, cumvar=None, kweight=2)[source]

Performs principal component analysis (PCA) on a collection.

Parameters
  • collection (Collection) – Collection with the groups for PCA.

  • taglist (List[str]) – List with keys to filter groups based on their tags attributes in the Collection. The default is [‘all’].

  • pca_region (str) – XAFS region to perform PCA. Accepted values are ‘dxanes’, ‘xanes’, or ‘exafs’. The default is ‘xanes’.

  • pca_range (list) – Domain range in absolute values. Energy units are expected for ‘dxanes’ or ‘xanes’, while wavenumber (k) units are expected for ‘exafs’. The default is [-inf, inf].

  • ncomps (Optional[int]) – Number of components to preserve from the PCA. Default is None, which will preserve all components.

  • cumvar (Optional[float]) – Cumulative variance to preserve from the PCA. Defaults is None, which will preserve all components.

  • kweight (int) – Exponent for weighting chi(k) by k^kweight. Only valid for cluster_region='exafs'. The default is 2.

Return type

PCAModel

Returns

PCA model with the following arguments:

  • components : array with principal components.

  • variance : array with explained variance of each component.

  • groupnames : list with names of clustered groups.

  • energy : array with energy values. Returned only if pca_region='xanes or pca_region=dxanes.

  • k : array with wavenumber values. Returned only if pca_region='exafs'.

  • matrix : array with centered values for groups in pca_range.

  • pca_pars : dictionary with PCA parameters.

See also

PCAModel

Class to store results from principal component analysis.

fig_pca()

Plots the results of principal component analysis.

Important

  • Group datasets in collection will be centered before performing PCA.

  • Either ncomps or cumvar can be set to reduce dimensionallity of the dataset. If ncomps is provided, it will set precedence over cumvar.

Example

>>> from araucaria.testdata import get_testpath
>>> from araucaria.io import read_collection_hdf5
>>> from araucaria.xas import pre_edge
>>> from araucaria.stats import PCAModel, pca
>>> from araucaria.utils import check_objattrs
>>> fpath      = get_testpath('Fe_database.h5')
>>> collection = read_collection_hdf5(fpath)
>>> collection.apply(pre_edge)
>>> out        = pca(collection, pca_region='xanes')
>>> attrs      = ['energy', 'matrix', 'components', 'variance', 'groupnames', 'pca_pars']
>>> check_objattrs(out, PCAModel, attrs)
[True, True, True, True, True, True]
araucaria.stats.pca.target_transform(model, collection, taglist=['all'])[source]

Performs target transformation on a collection.

Parameters
  • model (PCAModel) – PCA model to perform the projection and inverse transformation.

  • collection (Collection) – Collection with the groups for target transformatino.

  • taglist (List[str]) – List with keys to filter groups based on their tags attributes in the Collection. The default is [‘all’].

Return type

Dataset

Returns

Dataset with the following attributes.

  • groupnames: list with names of transformed groups.

  • energy : array with energy values. Returned only if pca_region='xanes or pca_region=dxanes.

  • k : array with wavenumber values. Returned only if pca_region='exafs'.

  • matrix : original array with mapped values.

  • tmatrix : array with target transformed groups.

  • scores : array with scores in the principal component basis.

  • chi2 : \(\chi^2\) values of the target tranformed groups.

Raises
  • TypeError – If model is not a valid PCAModel instance

  • KeyError – If attributes from pca() do not exist in model.

See also

pca()

Performs principal component analysis on a collection.

fig_target_transform()

Plots the results of target transformation.

Example

>>> from araucaria.testdata import get_testpath
>>> from araucaria import Dataset
>>> from araucaria.io import read_collection_hdf5
>>> from araucaria.xas import pre_edge
>>> from araucaria.stats import pca, target_transform
>>> from araucaria.utils import check_objattrs
>>> fpath      = get_testpath('Fe_database.h5')
>>> collection = read_collection_hdf5(fpath)
>>> collection.apply(pre_edge)
>>> model      = pca(collection, pca_region='xanes', cumvar=0.9)
>>> data       = target_transform(model, collection)
>>> attrs      = ['groupnames', 'tmatrix', 'chi2', 'scores', 'energy']
>>> check_objattrs(data, Dataset, attrs)
[True, True, True, True, True]