79 Surrogate Variable Analysis

The surrogate variable analysis (SVA) model combines the many responses model with the latent variable model introduced above:

\[ {\boldsymbol{Y}}_{m \times n} = {\boldsymbol{B}}_{m \times d} {\boldsymbol{X}}_{d \times n} + {\boldsymbol{\Phi}}_{m \times r} {\boldsymbol{Z}}_{r \times n} + {\boldsymbol{E}}_{m \times n} \]

where \(m \gg n > d + r\).

Here, only \({\boldsymbol{Y}}\) and \({\boldsymbol{X}}\) are observed, so we must combine many regressors model fitting techniques with latent variable estimation.

The variables \({\boldsymbol{Z}}\) are called surrogate variables for what would be a complete model of all systematic variation.

79.1 Procedure

The main challenge is that the row spaces of \({\boldsymbol{X}}\) and \({\boldsymbol{Z}}\) may overlap. Even when \({\boldsymbol{X}}\) is the result of a randomized experiment, there will be a high probability that the row spaces of \({\boldsymbol{X}}\) and \({\boldsymbol{Z}}\) have some overlap.

Therefore, one cannot simply estimate \({\boldsymbol{Z}}\) by applying a latent variable esitmation method on the residuals \({\boldsymbol{Y}}- \hat{{\boldsymbol{B}}} {\boldsymbol{X}}\) or on the observed response data \({\boldsymbol{Y}}\). In the former case, we will only estimate \({\boldsymbol{Z}}\) in the space orthogonal to \(\hat{{\boldsymbol{B}}} {\boldsymbol{X}}\). In the latter case, the estimate of \({\boldsymbol{Z}}\) may modify the signal we can estimate in \({\boldsymbol{B}}{\boldsymbol{X}}\).

A recent method, takes an EM approach to esitmating \({\boldsymbol{Z}}\) in the model

\[ {\boldsymbol{Y}}_{m \times n} = {\boldsymbol{B}}_{m \times d} {\boldsymbol{X}}_{d \times n} + {\boldsymbol{\Phi}}_{m \times r} {\boldsymbol{Z}}_{r \times n} + {\boldsymbol{E}}_{m \times n}. \]

It is shown to be necessary to penalize the likelihood in the estimation of \({\boldsymbol{B}}\) — i.e., form shrinkage estimates of \({\boldsymbol{B}}\) — in order to properly balance the row spaces of \({\boldsymbol{X}}\) and \({\boldsymbol{Z}}\).

The regularized EM algorithm, called cross-dimensonal inference (CDI) iterates between

  1. Estimate \({\boldsymbol{Z}}\) from \({\boldsymbol{Y}}- \hat{{\boldsymbol{B}}}^{\text{Reg}} {\boldsymbol{X}}\)
  2. Estimate \({\boldsymbol{B}}\) from \({\boldsymbol{Y}}- \hat{{\boldsymbol{\Phi}}} \hat{{\boldsymbol{Z}}}\)

where \(\hat{{\boldsymbol{B}}}^{\text{Reg}}\) is a regularized or shrunken estimate of \({\boldsymbol{B}}\).

It can be shown that when the regularization can be represented by a prior distribution on \({\boldsymbol{B}}\) then this algorithm achieves the MAP.

79.2 Example: Kidney Expr by Age

In Storey et al. (2005), we considered a study where kidney samples were obtained on individuals across a range of ages. The goal was to identify genes with expression associated with age.

> library(edge)
> library(splines)
> load("./data/kidney.RData")
> age <- kidcov$age
> sex <- kidcov$sex
> dim(kidexpr)
[1] 34061    72
> cov <- data.frame(sex = sex, age = age)
> null_model <- ~sex
> full_model <- ~sex + ns(age, df = 3)
> de_obj <- build_models(data = kidexpr, cov = cov, 
+                        null.model = null_model, 
+                        full.model = full_model)
> de_lrt <- lrt(de_obj, nullDistn = "bootstrap", bs.its = 100, verbose=FALSE)
> qobj1 <- qvalueObj(de_lrt)
> hist(qobj1)

Now that we have completed a standard generalized LRT, let’s estimate \({\boldsymbol{Z}}\) (the surrogate variables) using the sva package as accessed via the edge package.

> dim(nullMatrix(de_obj))
[1] 72  2
> de_sva <- apply_sva(de_obj, n.sv=4, method="irw", B=10)
Number of significant surrogate variables is:  4 
Iteration (out of 10 ):1  2  3  4  5  6  7  8  9  10  
> dim(nullMatrix(de_sva))
[1] 72  6
> de_svalrt <- lrt(de_sva, nullDistn = "bootstrap", bs.its = 100, verbose=FALSE)
> qobj2 <- qvalueObj(de_svalrt)
> hist(qobj2)

> summary(qobj1)

Call:
qvalue(p = pval)

pi0:    0.8081212   

Cumulative number of significant calls:

          <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1    <1
p-value       27    161   798   1676  2906 5271 34061
q-value        0      0     2      4    10   27 34061
local FDR      0      0     2      2     5   18 34061
> summary(qobj2)

Call:
qvalue(p = pval)

pi0:    0.6925105   

Cumulative number of significant calls:

          <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1    <1
p-value       28    151  1001   2051  3549 6168 34061
q-value        0      0     3      4     6   51 34061
local FDR      0      0     2      2     3   28 34053

P-values from two analyses are fairly different.

> data.frame(lrt=-log10(qobj1$pval), sva=-log10(qobj2$pval)) %>% 
+   ggplot() + geom_point(aes(x=lrt, y=sva), alpha=0.3) + geom_abline()