# 58 Ordinary Least Squares

Ordinary least squares (OLS) estimates the model

\begin{aligned} Y_i & = \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_p X_{ip} + E_i \\ & = {\boldsymbol{X}}_i {\boldsymbol{\beta}}+ E_i \end{aligned}

where $${\rm E}[E_i] = 0$$, $${\rm Var}(E_i) = \sigma^2$$, and $${\operatorname{Cov}}(E_i, E_j) = 0$$ for all $$1 \leq i, j \leq n$$ and $$i \not= j$$.

Note that typically $$X_{i1} = 1$$ for all $$i$$ so that $$\beta_1 X_{i1} = \beta_1$$ serves as the intercept.

## 58.1 OLS Solution

The estimates of $$\beta_1, \beta_2, \ldots, \beta_p$$ are found by identifying the values that minimize:

\begin{aligned} \sum_{i=1}^n \left[ Y_i - (\beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_p X_{ip}) \right]^2 \\ = ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}}) \end{aligned}

The solution is expressed in terms of matrix algebra computations:

$\hat{{\boldsymbol{\beta}}} = ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T {\boldsymbol{Y}}.$

## 58.2 Sample Variance

Let the predicted values of the model be

$\hat{{\boldsymbol{Y}}} = {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}} = {\boldsymbol{X}}({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T {\boldsymbol{Y}}.$

We estimate $$\sigma^2$$ by the OLS sample variance

$S^2 = \frac{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}{n-p}.$

## 58.3 Sample Covariance

The $$p$$-vector $$\hat{{\boldsymbol{\beta}}}$$ has covariance matrix

${\operatorname{Cov}}(\hat{{\boldsymbol{\beta}}} | {\boldsymbol{X}}) = ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2.$

Its estimated covariance matrix is

$\widehat{{\operatorname{Cov}}}(\hat{{\boldsymbol{\beta}}}) = ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} S^2.$

## 58.4 Expected Values

Under the assumption that $${\rm E}[E_i] = 0$$, $${\rm Var}(E_i) = \sigma^2$$, and $${\operatorname{Cov}}(E_i, E_j) = 0$$ for all $$1 \leq i, j \leq n$$ and $$i \not= j$$, we have the following:

${\operatorname{E}}\left[ \left. \hat{{\boldsymbol{\beta}}} \right| {\boldsymbol{X}}\right] = {\boldsymbol{\beta}}$

${\operatorname{E}}\left[ \left. S^2 \right| {\boldsymbol{X}}\right] = \sigma^2$

${\operatorname{E}}\left[\left. ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} S^2 \right| {\boldsymbol{X}}\right] = {\operatorname{Cov}}\left(\hat{{\boldsymbol{\beta}}}\right)$

${\operatorname{Cov}}\left(\hat{\beta}_j, Y_i - \hat{Y}_i\right) = \boldsymbol{0}.$

## 58.5 Standard Error

The standard error of $$\hat{\beta}_j$$ is the square root of the $$(j, j)$$ diagonal entry of $$({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2$$

${\operatorname{se}}(\hat{\beta}_j) = \sqrt{\left[({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2\right]_{jj}}$

and estimated standard error is

$\hat{{\operatorname{se}}}(\hat{\beta}_j) = \sqrt{\left[({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} S^2\right]_{jj}}$

## 58.6 Proportion of Variance Explained

The proportion of variance explained is defined equivalently to the simple linear regression scneario:

$R^2 = \frac{\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2}{\sum_{i=1}^n (Y_i - \bar{Y})^2}.$

## 58.7 Normal Errors

Suppose we assume $$E_1, E_2, \ldots, E_n {\; \stackrel{\text{iid}}{\sim}\;}\mbox{Normal}(0, \sigma^2)$$. Then

$\ell\left({\boldsymbol{\beta}}, \sigma^2 ; {\boldsymbol{Y}}, {\boldsymbol{X}}\right) \propto -n\log(\sigma^2) -\frac{1}{\sigma^2} ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}}).$

Since minimizing $$({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}})$$ maximizes the likelihood with respect to $${\boldsymbol{\beta}}$$, this implies $$\hat{{\boldsymbol{\beta}}}$$ is the MLE for $${\boldsymbol{\beta}}$$.

It can also be calculated that $$\frac{n-p}{n} S^2$$ is the MLE for $$\sigma^2$$.

## 58.8 Sampling Distribution

When $$E_1, E_2, \ldots, E_n {\; \stackrel{\text{iid}}{\sim}\;}\mbox{Normal}(0, \sigma^2)$$, it follows that, conditional on $${\boldsymbol{X}}$$:

$\hat{{\boldsymbol{\beta}}} \sim \mbox{MVN}_p\left({\boldsymbol{\beta}}, ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2 \right)$

\begin{aligned} S^2 \frac{n-p}{\sigma^2} & \sim \chi^2_{n-p} \\ \frac{\hat{\beta}_j - \beta_j}{\hat{{\operatorname{se}}}(\hat{\beta}_j)} & \sim t_{n-p} \end{aligned}

## 58.9 CLT

Under the assumption that $${\rm E}[E_i] = 0$$, $${\rm Var}(E_i) = \sigma^2$$, and $${\operatorname{Cov}}(E_i, E_j) = 0$$ for $$i \not= j$$, it follows that as $$n \rightarrow \infty$$,

$\sqrt{n} \left(\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}}\right) \stackrel{D}{\longrightarrow} \mbox{MVN}_p\left( \boldsymbol{0}, ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2 \right).$

## 58.10 Gauss-Markov Theorem

Under the assumption that $${\rm E}[E_i] = 0$$, $${\rm Var}(E_i) = \sigma^2$$, and $${\operatorname{Cov}}(E_i, E_j) = 0$$ for $$i \not= j$$, the Gauss-Markov theorem shows that among all BLUEs, best linear unbiased estimators, the least squares estimate has the smallest mean-squared error.

Specifically, suppose that $$\tilde{{\boldsymbol{\beta}}}$$ is a linear estimator (calculated from a linear operator on $${\boldsymbol{Y}}$$) where $${\operatorname{E}}[\tilde{{\boldsymbol{\beta}}} | {\boldsymbol{X}}] = {\boldsymbol{\beta}}$$. Then

${\operatorname{E}}\left[ \left. ({\boldsymbol{Y}}- {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}) \right| {\boldsymbol{X}}\right] \leq {\operatorname{E}}\left[ \left. ({\boldsymbol{Y}}- {\boldsymbol{X}}\tilde{{\boldsymbol{\beta}}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}\tilde{{\boldsymbol{\beta}}}) \right| {\boldsymbol{X}}\right].$