# 77 HD Latent Variable Models

## 77.1 Definition

Latent variables (or hidden variables) are random variables that are present in the underlying probabilistic model of the data, but they are unobserved.

In high-dimensional data, there may be latent variables present that affect many variables simultaneously.

These are latent variables that induce systematic variation. A topic of much interest is how to estimate these and incorporate them into further HD inference procedures.

## 77.2 Model

Suppose we have observed data $${\boldsymbol{Y}}_{m \times n}$$ of $$m$$ variables with $$n$$ observations each. Suppose there are $$r$$ latent variables contained in the $$r$$ rows of $${\boldsymbol{Z}}_{r \times n}$$ where

${\operatorname{E}}\left[{\boldsymbol{Y}}_{m \times n} \left. \right| {\boldsymbol{Z}}_{r \times n} \right] = {\boldsymbol{\Phi}}_{m \times r} {\boldsymbol{Z}}_{r \times n}.$

Let’s also assume that $$m \gg n > r$$. The latent variables $${\boldsymbol{Z}}$$ induce systematic variation in variable $${\boldsymbol{y}}_i$$ parameterized by $${\boldsymbol{\phi}}_i$$ for $$i = 1, 2, \ldots, m$$.

## 77.3 Estimation

There exist methods for estimating the row space of $${\boldsymbol{Z}}$$ with probability 1 as $$m \rightarrow \infty$$ for a fixed $$n$$ in two scenarios.

Leek (2011) shows how to do this when $${\boldsymbol{y}}_i | {\boldsymbol{Z}}\sim \text{MVN}({\boldsymbol{\phi}}_i {\boldsymbol{Z}}, \sigma^2_i {\boldsymbol{I}})$$, and the $${\boldsymbol{y}}_i | {\boldsymbol{Z}}$$ are jointly independent.

Chen and Storey (2015) show how to do this when the $${\boldsymbol{y}}_i | {\boldsymbol{Z}}$$ are distributed according to a single parameter exponential family distribution with mean $${\boldsymbol{\phi}}_i {\boldsymbol{Z}}$$, and the $${\boldsymbol{y}}_i | {\boldsymbol{Z}}$$ are jointly independent.