77 HD Latent Variable Models

77.1 Definition

Latent variables (or hidden variables) are random variables that are present in the underlying probabilistic model of the data, but they are unobserved.

In high-dimensional data, there may be latent variables present that affect many variables simultaneously.

These are latent variables that induce systematic variation. A topic of much interest is how to estimate these and incorporate them into further HD inference procedures.

77.2 Model

Suppose we have observed data \({\boldsymbol{Y}}_{m \times n}\) of \(m\) variables with \(n\) observations each. Suppose there are \(r\) latent variables contained in the \(r\) rows of \({\boldsymbol{Z}}_{r \times n}\) where

\[ {\operatorname{E}}\left[{\boldsymbol{Y}}_{m \times n} \left. \right| {\boldsymbol{Z}}_{r \times n} \right] = {\boldsymbol{\Phi}}_{m \times r} {\boldsymbol{Z}}_{r \times n}. \]

Let’s also assume that \(m \gg n > r\). The latent variables \({\boldsymbol{Z}}\) induce systematic variation in variable \({\boldsymbol{y}}_i\) parameterized by \({\boldsymbol{\phi}}_i\) for \(i = 1, 2, \ldots, m\).

77.3 Estimation

There exist methods for estimating the row space of \({\boldsymbol{Z}}\) with probability 1 as \(m \rightarrow \infty\) for a fixed \(n\) in two scenarios.

Leek (2011) shows how to do this when \({\boldsymbol{y}}_i | {\boldsymbol{Z}}\sim \text{MVN}({\boldsymbol{\phi}}_i {\boldsymbol{Z}}, \sigma^2_i {\boldsymbol{I}})\), and the \({\boldsymbol{y}}_i | {\boldsymbol{Z}}\) are jointly independent.

Chen and Storey (2015) show how to do this when the \({\boldsymbol{y}}_i | {\boldsymbol{Z}}\) are distributed according to a single parameter exponential family distribution with mean \({\boldsymbol{\phi}}_i {\boldsymbol{Z}}\), and the \({\boldsymbol{y}}_i | {\boldsymbol{Z}}\) are jointly independent.