66 Generalized Linear Models

66.1 Definition

The generalized linear model (GLM) builds from OLS and GLS to allow for the case where \(Y | {\boldsymbol{X}}\) is distributed according to an exponential family distribution. The estimated model is

\[ g\left({\operatorname{E}}[Y | {\boldsymbol{X}}]\right) = {\boldsymbol{X}}{\boldsymbol{\beta}}\]

where \(g(\cdot)\) is called the link function. This model is typically fit by numerical methods to calculate the maximum likelihood estimate of \({\boldsymbol{\beta}}\).

66.2 Exponential Family Distributions

Recall that if \(Y\) follows an EFD then it has pdf of the form

\[f(y ; \boldsymbol{\theta}) = h(y) \exp \left\{ \sum_{k=1}^d \eta_k(\boldsymbol{\theta}) T_k(y) - A(\boldsymbol{\eta}) \right\} \]

where \(\boldsymbol{\theta}\) is a vector of parameters, \(\{T_k(y)\}\) are sufficient statistics, \(A(\boldsymbol{\eta})\) is the cumulant generating function.

The functions \(\eta_k(\boldsymbol{\theta})\) for \(k=1, \ldots, d\) map the usual parameters \({\boldsymbol{\theta}}\) (often moments of the rv \(Y\)) to the natural parameters or canonical parameters.

\(\{T_k(y)\}\) are sufficient statistics for \(\{\eta_k\}\) due to the factorization theorem.

\(A(\boldsymbol{\eta})\) is sometimes called the log normalizer because

\[A(\boldsymbol{\eta}) = \log \int h(y) \exp \left\{ \sum_{k=1}^d \eta_k(\boldsymbol{\theta}) T_k(y) \right\}.\]

66.3 Natural Single Parameter EFD

A natural single parameter EFD simplifies to the scenario where \(d=1\) and \(T(y) = y\)

\[f(y ; \eta) = h(y) \exp \left\{ \eta(\theta) y - A(\eta(\theta)) \right\} \]

where without loss of generality we can write \({\operatorname{E}}[Y] = \theta\).

66.4 Dispersion EFDs

The family of distributions for which GLMs are most typically developed are dispersion EFDs. An example of a dispersion EFD that extends the natural single parameter EFD is

\[f(y ; \eta) = h(y, \phi) \exp \left\{ \frac{\eta(\theta) y - A(\eta(\theta))}{\phi} \right\} \]

where \(\phi\) is the dispersion parameter.

66.5 Example: Normal

Let \(Y \sim \mbox{Normal}(\mu, \sigma^2)\). Then:

\[ \theta = \mu, \eta(\mu) = \mu \]

\[ \phi = \sigma^2 \]

\[ A(\mu) = \frac{\mu^2}{2} \]

\[ h(y, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2}\frac{y^2}{\sigma^2}} \]

66.6 EFD for GLMs

There has been a very broad development of GLMs and extensions. A common setting for introducting GLMs is the dispersion EFD with a general link function \(g(\cdot)\).

See the classic text Generalized Linear Models, by McCullagh and Nelder, for such a development.

66.7 Components of a GLM

  1. Random: The particular exponential family distribution. \[ Y \sim f(y ; \eta, \phi) \]

  2. Systematic: The determination of each \(\eta_i\) from \({\boldsymbol{X}}_i\) and \({\boldsymbol{\beta}}\). \[ \eta_i = {\boldsymbol{X}}_i {\boldsymbol{\beta}}\]

  3. Parametric Link: The connection between \(E[Y_i|{\boldsymbol{X}}_i]\) and \({\boldsymbol{X}}_i {\boldsymbol{\beta}}\). \[ g(E[Y_i|{\boldsymbol{X}}_i]) = {\boldsymbol{X}}_i {\boldsymbol{\beta}}\]

66.9 Calculating MLEs

Given the model \(g\left({\operatorname{E}}[Y | {\boldsymbol{X}}]\right) = {\boldsymbol{X}}{\boldsymbol{\beta}}\), the EFD should be fully parameterized. The Newton-Raphson method or Fisher’s scoring method can be utilized to find the MLE of \({\boldsymbol{\beta}}\).

66.10 Newton-Raphson

  1. Initialize \({\boldsymbol{\beta}}^{(0)}\). For \(t = 1, 2, \ldots\)

  2. Calculate the score \(s({\boldsymbol{\beta}}^{(t)}) = \nabla \ell({\boldsymbol{\beta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \mid_{{\boldsymbol{\beta}}= {\boldsymbol{\beta}}^{(t)}}\) and observed Fisher information \[H({\boldsymbol{\beta}}^{(t)}) = - \nabla \nabla^T \ell({\boldsymbol{\beta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \mid_{{\boldsymbol{\beta}}= {\boldsymbol{\beta}}^{(t)}}\]. Note that the observed Fisher information is also the negative Hessian matrix.

  3. Update \({\boldsymbol{\beta}}^{(t+1)} = {\boldsymbol{\beta}}^{(t)} + H({\boldsymbol{\beta}}^{(t)})^{-1} s({\boldsymbol{\beta}}^{(t)})\).

  4. Iterate until convergence, and set \(\hat{{\boldsymbol{\beta}}} = {\boldsymbol{\beta}}^{(\infty)}\).

66.11 Fisher’s scoring

  1. Initialize \({\boldsymbol{\beta}}^{(0)}\). For \(t = 1, 2, \ldots\)

  2. Calculate the score \(s({\boldsymbol{\beta}}^{(t)}) = \nabla \ell({\boldsymbol{\beta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \mid_{{\boldsymbol{\beta}}= {\boldsymbol{\beta}}^{(t)}}\) and expected Fisher information \[I({\boldsymbol{\beta}}^{(t)}) = - {\operatorname{E}}\left[\nabla \nabla^T \ell({\boldsymbol{\beta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \mid_{{\boldsymbol{\beta}}= {\boldsymbol{\beta}}^{(t)}} \right].\]

  3. Update \({\boldsymbol{\beta}}^{(t+1)} = {\boldsymbol{\beta}}^{(t)} + I({\boldsymbol{\beta}}^{(t)})^{-1} s({\boldsymbol{\beta}}^{(t)})\).

  4. Iterate until convergence, and set \(\hat{{\boldsymbol{\beta}}} = {\boldsymbol{\beta}}^{(\infty)}\).

When the canonical link function is used, the Newton-Raphson algorithm and Fisher’s scoring algorithm are equivalent.

Exercise: Prove this.

66.12 Iteratively Reweighted Least Squares

For the canonical link, Fisher’s scoring method can be written as an iteratively reweighted least squares algorithm, as shown earlier for logistic regression. Note that the Fisher information is

\[ I({\boldsymbol{\beta}}^{(t)}) = {\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}}\]

where \({\boldsymbol{W}}\) is an \(n \times n\) diagonal matrix with \((i, i)\) entry equal to \({\operatorname{Var}}(Y_i | {\boldsymbol{X}}; {\boldsymbol{\beta}}^{(t)})\).

The score function is

\[ s({\boldsymbol{\beta}}^{(t)}) = {\boldsymbol{X}}^T \left( {\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}}^{(t)} \right) \]

and the current coefficient value \({\boldsymbol{\beta}}^{(t)}\) can be written as

\[ {\boldsymbol{\beta}}^{(t)} = ({\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}}{\boldsymbol{\beta}}^{(t)}. \]

Putting this together we get

\[ {\boldsymbol{\beta}}^{(t)} + I({\boldsymbol{\beta}}^{(t)})^{-1} s({\boldsymbol{\beta}}^{(t)}) = ({\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{z}}^{(t)} \]

where

\[ {\boldsymbol{z}}^{(t)} = {\boldsymbol{X}}{\boldsymbol{\beta}}^{(t)} + {\boldsymbol{W}}^{-1} \left( {\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}}^{(t)} \right). \]

This is a generalization of the iteratively reweighted least squares algorithm we showed earlier for logistic regression.

66.13 Estimating Dispersion

For the simple dispersion model above, it is typically straightforward to calculate the MLE \(\hat{\phi}\) once \(\hat{{\boldsymbol{\beta}}}\) has been calculated.

66.14 CLT Applied to the MLE

Given that \(\hat{{\boldsymbol{\beta}}}\) is a maximum likelihood estimate, we have the following CLT result on its distribution as \(n \rightarrow \infty\):

\[ \sqrt{n} (\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}}) \stackrel{D}{\longrightarrow} \mbox{MVN}_{p}({\boldsymbol{0}}, \phi ({\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}})^{-1}) \]

66.15 Approximately Pivotal Statistics

The previous CLT gives us the following two approximations for pivtoal statistics. The first statistic facilitates getting overall measures of uncertainty on the estimate \(\hat{{\boldsymbol{\beta}}}\).

\[ \hat{\phi}^{-1} (\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}})^T ({\boldsymbol{X}}^T \hat{{\boldsymbol{W}}} {\boldsymbol{X}}) (\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}}) {\; \stackrel{.}{\sim}\;}\chi^2_1 \]

This second pivotal statistic allows for performing a Wald test or forming a confidence interval on each coefficient, \(\beta_j\), for \(j=1, \ldots, p\).

\[ \frac{\hat{\beta}_j - \beta_j}{\sqrt{\hat{\phi} [({\boldsymbol{X}}^T \hat{{\boldsymbol{W}}} {\boldsymbol{X}})^{-1}]_{jj}}} {\; \stackrel{.}{\sim}\;}\mbox{Normal}(0,1) \]

66.16 Deviance

Let \(\hat{\boldsymbol{\eta}}\) be the estimated natural parameters from a GLM. For example, \(\hat{\boldsymbol{\eta}} = {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}\) when the canonical link function is used.

Let \(\hat{\boldsymbol{\eta}}_n\) be the saturated model wwhere \(Y_i\) is directly used to estimate \(\eta_i\) without model constraints. For example, in the Bernoulli logistic regression model \(\hat{\boldsymbol{\eta}}_n = {\boldsymbol{Y}}\), the observed outcomes.

The deviance for the model is defined to be

\[ D\left(\hat{\boldsymbol{\eta}}\right) = 2 \ell(\hat{\boldsymbol{\eta}}_n; {\boldsymbol{X}}, {\boldsymbol{Y}}) - 2 \ell(\hat{\boldsymbol{\eta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \]

66.17 Generalized LRT

Let \({\boldsymbol{X}}_0\) be a subset of \(p_0\) columns of \({\boldsymbol{X}}\) and let \({\boldsymbol{X}}_1\) be a subset of \(p_1\) columns, where \(1 \leq p_0 < p_1 \leq p\). Also, assume that the columns of \({\boldsymbol{X}}_0\) are a subset of \({\boldsymbol{X}}_1\).

Without loss of generality, suppose that \({\boldsymbol{\beta}}_0 = (\beta_1, \beta_2, \ldots, \beta_{p_0})^T\) and \({\boldsymbol{\beta}}_1 = (\beta_1, \beta_2, \ldots, \beta_{p_1})^T\).

Suppose we wish to test \(H_0: (\beta_{p_0+1}, \beta_{p_0 + 2}, \ldots, \beta_{p_1}) = {\boldsymbol{0}}\) vs \(H_1: (\beta_{p_0+1}, \beta_{p_0 + 2}, \ldots, \beta_{p_1}) \not= {\boldsymbol{0}}\)

We can form \(\hat{\boldsymbol{\eta}}_0 = {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}_0\) from the GLM model \(g\left({\operatorname{E}}[Y | {\boldsymbol{X}}_0]\right) = {\boldsymbol{X}}_0 {\boldsymbol{\beta}}_0\). We can analogously form \(\hat{\boldsymbol{\eta}}_1 = {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}_1\) from the GLM model \(g\left({\operatorname{E}}[Y | {\boldsymbol{X}}_1]\right) = {\boldsymbol{X}}_1 {\boldsymbol{\beta}}_1\).

The \(2\log\) generalized LRT can then be formed from the two deviance statistics

\[ 2 \log \lambda({\boldsymbol{X}}, {\boldsymbol{Y}}) = 2 \log \frac{L(\hat{\boldsymbol{\eta}}_1; {\boldsymbol{X}}, {\boldsymbol{Y}})}{L(\hat{\boldsymbol{\eta}}_0; {\boldsymbol{X}}, {\boldsymbol{Y}})} = D\left(\hat{\boldsymbol{\eta}}_0\right) - D\left(\hat{\boldsymbol{\eta}}_1\right) \]

where the null distribution is \(\chi^2_{p_1-p_0}\).

66.18 Example: Grad School Admissions

Let’s revisit a logistic regression example now that we know how the various statistics are calculated.

> mydata <- 
+   read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> dim(mydata)
> head(mydata)

Fit the model with basic output. Note the argument family = "binomial".

> mydata$rank <- factor(mydata$rank, levels=c(1, 2, 3, 4))
> myfit <- glm(admit ~ gre + gpa + rank, 
+              data = mydata, family = "binomial")
> myfit

Call:  glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = mydata)

Coefficients:
(Intercept)          gre          gpa        rank2        rank3  
  -3.989979     0.002264     0.804038    -0.675443    -1.340204  
      rank4  
  -1.551464  

Degrees of Freedom: 399 Total (i.e. Null);  394 Residual
Null Deviance:      500 
Residual Deviance: 458.5    AIC: 470.5

This shows the fitted coefficient values, which is on the link function scale – logit aka log odds here. Also, a Wald test is performed for each coefficient.

> summary(myfit)

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

Here we perform a generalized LRT on each variable. Note the rank variable is now tested as a single factor variable as opposed to each dummy variable.

> anova(myfit, test="LRT")
Analysis of Deviance Table

Model: binomial, link: logit

Response: admit

Terms added sequentially (first to last)

     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                   399     499.98              
gre   1  13.9204       398     486.06 0.0001907 ***
gpa   1   5.7122       397     480.34 0.0168478 *  
rank  3  21.8265       394     458.52 7.088e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mydata <- data.frame(mydata, probs = myfit$fitted.values)
> ggplot(mydata) + geom_point(aes(x=gpa, y=probs, color=rank)) +
+   geom_jitter(aes(x=gpa, y=admit), width=0, height=0.01, alpha=0.3)

> ggplot(mydata) + geom_point(aes(x=gre, y=probs, color=rank)) +
+   geom_jitter(aes(x=gre, y=admit), width=0, height=0.01, alpha=0.3)

> ggplot(mydata) + geom_boxplot(aes(x=rank, y=probs)) +
+   geom_jitter(aes(x=rank, y=probs), width=0.1, height=0.01, alpha=0.3)

66.19 glm() Function

The glm() function has many different options available to the user.

glm(formula, family = gaussian, data, weights, subset,
    na.action, start = NULL, etastart, mustart, offset,
    control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, contrasts = NULL, ...)

To see the different link functions available, type:

help(family)