69 Bootstrap for Statistical Models

69.1 Homoskedastic Models

Let’s first discuss how one can utilize the bootstrap on any of the three homoskedastic models:

Simple linear regression
Ordinary least squares
Additive models

69.2 Residuals

In each of these scenarios we sample data \(({\boldsymbol{X}}_1, Y_1), ({\boldsymbol{X}}_2, Y_2), \ldots, ({\boldsymbol{X}}_n, Y_n)\). Let suppose we calculate fitted values \(\hat{Y}_i\) and they are unbiased:

\[ {\operatorname{E}}[\hat{Y}_i | {\boldsymbol{X}}] = {\operatorname{E}}[Y_i | {\boldsymbol{X}}]. \]

We can calculate residuals \(\hat{E}_i = Y_i - \hat{Y}_i\) for \(i=1, 2, \ldots, n\).

69.3 Studentized Residuals

One complication is that the residuals have a covariance. For example, in OLS we showed that

\[ {\operatorname{Cov}}(\hat{{\boldsymbol{E}}}) = \sigma^2 ({\boldsymbol{I}}- {\boldsymbol{P}}) \]

where \({\boldsymbol{P}}= {\boldsymbol{X}}({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T\).

To correct for this induced heteroskedasticity, we studentize the residuals by calculating

\[ R_i = \frac{\hat{E}_i}{\sqrt{1-P_{ii}}} \]

which gives \({\operatorname{Cov}}({\boldsymbol{R}}) = \sigma^2 {\boldsymbol{I}}\).

69.4 Confidence Intervals

The following is a bootstrap procedure for calculating a confidence interval on some statistic \(\hat{\theta}\) calculated from a homoskedastic model fit. An example is \(\hat{\beta}_j\) in an OLS.

Fit the model to obtain fitted values \(\hat{Y}_i\), studentized residuals \(R_i\), and the statistic of interest \(\hat{\theta}\).
For \(b = 1, 2, \ldots, B\).
Sample \(n\) observations with replacement from \(\{R_i\}_{i=1}^n\) to obtain bootstrap residuals \(R_1^{*}, R_2^{*}, \ldots, R_n^{*}\).
Form new response variables \(Y_i^{*} = \hat{Y}_i + R_i^{*}\).
Fit the model to obtain \(\hat{Y}^{*}_i\) and all other fitted parameters.
Calculate statistic of interest \(\hat{\theta}^{*(b)}\).

The bootstrap statistics \(\hat{\theta}^{*(1)}, \hat{\theta}^{*(2)}, \ldots, \hat{\theta}^{*(B)}\) are then utilized through one of the techniques discussed earlier (percentile, pivotal, studentized pivotal) to calculate a bootstrap CI.

69.5 Hypothesis Testing

Suppose we are testing the hypothesis \(H_0: {\operatorname{E}}[Y | {\boldsymbol{X}}] = f_0({\boldsymbol{X}})\) vs \(H_1: {\operatorname{E}}[Y | {\boldsymbol{X}}] = f_1({\boldsymbol{X}})\). Suppose it is possible to form unbiased estimates \(f_0({\boldsymbol{X}})\) and \(f_1({\boldsymbol{X}})\) given \({\boldsymbol{X}}\), and \(f_0\) is a restricted version of \(f_1\).

Suppose also we have a statistic \(T(\hat{f}_0, \hat{f}_1)\) for performing this test so that the larger the statistic, the more evidence there is against the null hypothesis in favor of the alternative.

The big picture strategy is to bootstrap studentized residuals from the unconstrained (alternative hypothesis) fitted model and then add those to the constrained (null hypothesis) fitted model to generate bootstrap null data sets.

Fit the models to obtain fitted values \(\hat{f}_0({\boldsymbol{X}}_i)\) and \(\hat{f}_1({\boldsymbol{X}}_i)\), studentized residuals \(R_i\) from the fit \(\hat{f}_1({\boldsymbol{X}}_i)\), and the observed statistic \(T(\hat{f}_0, \hat{f}_1)\).
For \(b = 1, 2, \ldots, B\).
Sample \(n\) observations with replacement from \(\{R_i\}_{i=1}^n\) to obtain bootstrap residuals \(R_1^{*}, R_2^{*}, \ldots, R_n^{*}\).
Form new response variables \(Y_i^{*} = \hat{f}_0({\boldsymbol{X}}_i) + R_i^{*}\).
Fit the models on the response variables \(Y_i^{*}\) to obtain \(\hat{f}^{*}_0\) and \(\hat{f}^{*}_1\).
Calculate statistic \(T(\hat{f}^{*(b)}_0, \hat{f}^{*(b)}_1)\).

The p-value is then calculated as

\[ \frac{\sum_{b=1}^B 1\left(T(\hat{f}^{*(b)}_0, \hat{f}^{*(b)}_1) \geq T(\hat{f}_0, \hat{f}_1) \right) }{B} \]

69.6 Parametric Bootstrap

For more complex scenarios, such as GLMs, GAMs, and heteroskedastic models, it is typically more straightforward to utilize a parametric bootstrap.