19 Population Principal Components Analysis

Suppose we have m random variables X_1, X_2, \ldots, X_m. We wish to identify a set of weights w_1, w_2, \ldots, w_m that maximizes

{\operatorname{Var}}\left(w_1 X_1 + w_2 X_2 + \cdots + w_m X_m \right).

However, this is unbounded, so we need to constrain the weights. It turns out that constraining the weights so that

\| {\boldsymbol{w}}\|_2^2 = \sum_{i=1}^m w_i^2 = 1

is both interpretable and mathematically tractable.

Therefore we wish to maximize

{\operatorname{Var}}\left(w_1 X_1 + w_2 X_2 + \cdots + w_m X_m \right)

subject to \| {\boldsymbol{w}}\|_2^2 = 1. Let {\boldsymbol{\Sigma}} be the m \times m population covariance matrix of the random variables X_1, X_2, \ldots, X_m. It follows that

{\operatorname{Var}}\left(w_1 X_1 + w_2 X_2 + \cdots + w_m X_m \right) = {\boldsymbol{w}}^T {\boldsymbol{\Sigma}}{\boldsymbol{w}}.

Using a Lagrange multiplier, we wish to maximize

{\boldsymbol{w}}^T {\boldsymbol{\Sigma}}{\boldsymbol{w}}+ \lambda({\boldsymbol{w}}^T {\boldsymbol{w}}- 1).

Differentiating with respect to {\boldsymbol{w}} and setting to {\boldsymbol{0}}, we get {\boldsymbol{\Sigma}}{\boldsymbol{w}}- \lambda {\boldsymbol{w}}= 0 or

{\boldsymbol{\Sigma}}{\boldsymbol{w}}= \lambda {\boldsymbol{w}}.

For any such {\boldsymbol{w}} and \lambda where this holds, note that

{\operatorname{Var}}\left(w_1 X_1 + w_2 X_2 + \cdots + w_m X_m \right) = {\boldsymbol{w}}^T {\boldsymbol{\Sigma}}{\boldsymbol{w}}= \lambda

so the variance is \lambda.

The eigendecompositon of a matrix identifies all such solutions to {\boldsymbol{\Sigma}}{\boldsymbol{w}}= \lambda {\boldsymbol{w}}. Specifically, it calculates the decompositon

{\boldsymbol{\Sigma}}= {\boldsymbol{W}}{\boldsymbol{\Lambda}}{\boldsymbol{W}}^T

where {\boldsymbol{W}} is an m \times m orthogonal matrix and {\boldsymbol{\Lambda}} is a diagonal matrix with entries \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_m \geq 0.

The fact that {\boldsymbol{W}} is orthogonal means {\boldsymbol{W}}{\boldsymbol{W}}^T = {\boldsymbol{W}}^T {\boldsymbol{W}}= {\boldsymbol{I}}.

The following therefore hold:

  • For each column j of {\boldsymbol{W}}, say {\boldsymbol{w}}_j, it follows that {\boldsymbol{\Sigma}}{\boldsymbol{w}}_j = \lambda_j {\boldsymbol{w}}_j
  • \| {\boldsymbol{w}}_j \|^2_2 = 1 and {\boldsymbol{w}}_j^T {\boldsymbol{w}}_k = {\boldsymbol{0}} for \lambda_j \not= \lambda_k
  • {\operatorname{Var}}({\boldsymbol{w}}_j^T {\boldsymbol{X}}) = \lambda_j
  • {\operatorname{Var}}({\boldsymbol{w}}_1^T {\boldsymbol{X}}) \geq {\operatorname{Var}}({\boldsymbol{w}}_2^T {\boldsymbol{X}}) \geq \cdots \geq {\operatorname{Var}}({\boldsymbol{w}}_m^T {\boldsymbol{X}})
  • {\boldsymbol{\Sigma}}= \sum_{j=1}^m \lambda_j {\boldsymbol{w}}_j {\boldsymbol{w}}_j^T
  • For \lambda_j \not= \lambda_k, {\operatorname{Cov}}({\boldsymbol{w}}_j^T {\boldsymbol{X}}, {\boldsymbol{w}}_k^T {\boldsymbol{X}}) = {\boldsymbol{w}}_j^T {\boldsymbol{\Sigma}}{\boldsymbol{w}}_k = \lambda_k {\boldsymbol{w}}_j^T {\boldsymbol{w}}_k = {\boldsymbol{0}}

The jth population principal component (PC) of X_1, X_2, \ldots, X_m is

{\boldsymbol{w}}_j^T {\boldsymbol{X}}= w_{1j} X_1 + w_{2j} X_2 + \cdots + w_{mj} X_m

where {\boldsymbol{w}}_j = (w_{1j}, w_{2j}, \ldots, w_{mj})^T is column j of {\boldsymbol{W}} from the eigendecomposition

{\boldsymbol{\Sigma}}= {\boldsymbol{W}}{\boldsymbol{\Lambda}}{\boldsymbol{W}}^T.

The column {\boldsymbol{w}}_j are called the loadings of the jth principal component. The variance explained by the jth PC is \lambda_j, which is diagonal element j of {\boldsymbol{\Lambda}}.