68 Generalized Additive Models
68.1 Ordinary Least Squares
Recall that OLS estimates the model
\begin{aligned} Y_i & = \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_p X_{ip} + E_i \\ & = {\boldsymbol{X}}_i {\boldsymbol{\beta}}+ E_i \end{aligned}
where {\operatorname{E}}[{\boldsymbol{E}}| {\boldsymbol{X}}] = {\boldsymbol{0}} and {\operatorname{Cov}}({\boldsymbol{E}}| {\boldsymbol{X}}) = \sigma^2 {\boldsymbol{I}}.
68.2 Additive Models
The additive model (which could also be called “ordinary nonparametric additive regression”) is of the form
\begin{aligned} Y_i & = s_1(X_{i1}) + s_2(X_{i2}) + \ldots + s_p(X_{ip}) + E_i \\ & = \sum_{j=1}^p s_j(X_{ij}) + E_i \end{aligned}
where the s_j(\cdot) for j = 1, \ldots, p are a set of nonparametric (or flexible) functions. Again, we assume that {\operatorname{E}}[{\boldsymbol{E}}| {\boldsymbol{X}}] = {\boldsymbol{0}} and {\operatorname{Cov}}({\boldsymbol{E}}| {\boldsymbol{X}}) = \sigma^2 {\boldsymbol{I}}.
68.3 Backfitting
The additive model can be fit through a technique called backfitting.
- Intialize s^{(0)}_j(x) for j = 1, \ldots, p.
- For t=1, 2, \ldots, fit s_j^{(t)}(x) on response variable y_i - \sum_{k \not= j} s_k^{(t-1)}(x_{ij}).
- Repeat until convergence.
Note that some extra steps have to be taken to deal with the intercept.
68.4 GAM Definition
Y | {\boldsymbol{X}} is distributed according to an exponential family distribution. The extension of additive models to this family of response variable is called generalized additive models (GAMs). The model is of the form
g\left({\operatorname{E}}[Y_i | {\boldsymbol{X}}_i]\right) = \sum_{j=1}^p s_j(X_{ij})
where g(\cdot) is the link function and the s_j(\cdot) are flexible and/or nonparametric functions.
68.5 Overview of Fitting GAMs
Fitting GAMs involves putting together the following three tools:
- We know how to fit a GLM via IRLS
- We know how to fit a smoother of a single explanatory variable via a least squares solution, as seen for the NCS
- We know how to combine additive smoothers by backfitting
68.6 GAMs in R
Three common ways to fit GAMs in R:
- Utilize
glm()
on explanatory variables constructed fromns()
orbs()
- The
gam
library - The
mgcv
library
68.7 Example
> set.seed(508)
> x1 <- seq(1, 10, length.out=50)
> n <- length(x1)
> x2 <- rnorm(n)
> f <- 4*log(x1) + sin(x1) - 7 + 0.5*x2
> p <- exp(f)/(1+exp(f))
> summary(p)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.001842 0.074171 0.310674 0.436162 0.860387 0.944761
> y <- rbinom(n, size=1, prob=p)
> mean(y)
[1] 0.42
> df <- data.frame(x1=x1, x2=x2, y=y)
Here, we use the gam()
function from the mgcv
library. It automates choosing the smoothness of the splines.
> library(mgcv)
> mygam <- gam(y ~ s(x1) + s(x2), family = binomial(), data=df)
> library(broom)
> tidy(mygam)
# A tibble: 2 x 5
term edf ref.df statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 s(x1) 1.87 2.37 12.7 0.00531
2 s(x2) 1.00 1.00 1.16 0.281
> summary(mygam)
Family: binomial
Link function: logit
Formula:
y ~ s(x1) + s(x2)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1380 0.6723 -1.693 0.0905 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(x1) 1.87 2.375 12.743 0.00531 **
s(x2) 1.00 1.000 1.163 0.28084
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.488 Deviance explained = 47%
UBRE = -0.12392 Scale est. = 1 n = 50
True probabilities vs. estimated probabilities.
> plot(p, mygam$fitted.values, pch=19); abline(0,1)
Smoother estimated for x1
.
> plot(mygam, select=1)
Smoother estimated for x2
.
> plot(mygam, select=2)
Here, we use the glm()
function and include as an explanatory variable a NCS built from the ns()
function from the splines
library. We include a df
argument in the ns()
call.
> library(splines)
> myglm <- glm(y ~ ns(x1, df=2) + x2, family = binomial(), data=df)
> tidy(myglm)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -10.9 5.31 -2.06 0.0396
2 ns(x1, df = 2)1 21.4 10.1 2.11 0.0348
3 ns(x1, df = 2)2 6.33 2.11 3.00 0.00272
4 x2 0.734 0.609 1.21 0.228
The spline basis evaluated at x1
values.
> ns(x1, df=2)
1 2
[1,] 0.00000000 0.00000000
[2,] 0.03114456 -0.02075171
[3,] 0.06220870 -0.04138180
[4,] 0.09311200 -0.06176867
[5,] 0.12377405 -0.08179071
[6,] 0.15411442 -0.10132630
[7,] 0.18405270 -0.12025384
[8,] 0.21350847 -0.13845171
[9,] 0.24240131 -0.15579831
[10,] 0.27065081 -0.17217201
[11,] 0.29817654 -0.18745121
[12,] 0.32489808 -0.20151430
[13,] 0.35073503 -0.21423967
[14,] 0.37560695 -0.22550571
[15,] 0.39943343 -0.23519080
[16,] 0.42213406 -0.24317334
[17,] 0.44362840 -0.24933170
[18,] 0.46383606 -0.25354429
[19,] 0.48267660 -0.25568949
[20,] 0.50006961 -0.25564569
[21,] 0.51593467 -0.25329128
[22,] 0.53019136 -0.24850464
[23,] 0.54275927 -0.24116417
[24,] 0.55355797 -0.23114825
[25,] 0.56250705 -0.21833528
[26,] 0.56952943 -0.20260871
[27,] 0.57462513 -0.18396854
[28,] 0.57787120 -0.16253131
[29,] 0.57934806 -0.13841863
[30,] 0.57913614 -0.11175212
[31,] 0.57731586 -0.08265339
[32,] 0.57396762 -0.05124405
[33,] 0.56917185 -0.01764570
[34,] 0.56300897 0.01802003
[35,] 0.55555939 0.05563154
[36,] 0.54690354 0.09506722
[37,] 0.53712183 0.13620546
[38,] 0.52629468 0.17892464
[39,] 0.51450251 0.22310315
[40,] 0.50182573 0.26861939
[41,] 0.48834478 0.31535174
[42,] 0.47414005 0.36317859
[43,] 0.45929198 0.41197833
[44,] 0.44388099 0.46162934
[45,] 0.42798748 0.51201003
[46,] 0.41169188 0.56299877
[47,] 0.39507460 0.61447395
[48,] 0.37821607 0.66631397
[49,] 0.36119670 0.71839720
[50,] 0.34409692 0.77060206
attr(,"degree")
[1] 3
attr(,"knots")
50%
5.5
attr(,"Boundary.knots")
[1] 1 10
attr(,"intercept")
[1] FALSE
attr(,"class")
[1] "ns" "basis" "matrix"
Plot of basis function values vs x1
.
> summary(myglm)
Call:
glm(formula = y ~ ns(x1, df = 2) + x2, family = binomial(), data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0214 -0.3730 -0.0162 0.5762 1.7616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.9229 5.3079 -2.058 0.03960 *
ns(x1, df = 2)1 21.3848 10.1318 2.111 0.03480 *
ns(x1, df = 2)2 6.3266 2.1103 2.998 0.00272 **
x2 0.7342 0.6089 1.206 0.22795
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 68.029 on 49 degrees of freedom
Residual deviance: 35.682 on 46 degrees of freedom
AIC: 43.682
Number of Fisher Scoring iterations: 7
> anova(myglm, test="LRT")
Analysis of Deviance Table
Model: binomial, link: logit
Response: y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 49 68.029
ns(x1, df = 2) 2 30.755 47 37.274 2.097e-07 ***
x2 1 1.592 46 35.682 0.207
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
True probabilities vs. estimated probabilities.
> plot(p, myglm$fitted.values, pch=19); abline(0,1)