58 Ordinary Least Squares
Ordinary least squares (OLS) estimates the model
\[ \begin{aligned} Y_i & = \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_p X_{ip} + E_i \\ & = {\boldsymbol{X}}_i {\boldsymbol{\beta}}+ E_i \end{aligned} \]
where \({\rm E}[E_i] = 0\), \({\rm Var}(E_i) = \sigma^2\), and \({\operatorname{Cov}}(E_i, E_j) = 0\) for all \(1 \leq i, j \leq n\) and \(i \not= j\).
Note that typically \(X_{i1} = 1\) for all \(i\) so that \(\beta_1 X_{i1} = \beta_1\) serves as the intercept.
58.1 OLS Solution
The estimates of \(\beta_1, \beta_2, \ldots, \beta_p\) are found by identifying the values that minimize:
\[ \begin{aligned} \sum_{i=1}^n \left[ Y_i - (\beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_p X_{ip}) \right]^2 \\ = ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}}) \end{aligned} \]
The solution is expressed in terms of matrix algebra computations:
\[ \hat{{\boldsymbol{\beta}}} = ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T {\boldsymbol{Y}}. \]
58.2 Sample Variance
Let the predicted values of the model be
\[ \hat{{\boldsymbol{Y}}} = {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}} = {\boldsymbol{X}}({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T {\boldsymbol{Y}}. \]
We estimate \(\sigma^2\) by the OLS sample variance
\[ S^2 = \frac{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}{n-p}. \]
58.3 Sample Covariance
The \(p\)-vector \(\hat{{\boldsymbol{\beta}}}\) has covariance matrix
\[ {\operatorname{Cov}}(\hat{{\boldsymbol{\beta}}} | {\boldsymbol{X}}) = ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2. \]
Its estimated covariance matrix is
\[ \widehat{{\operatorname{Cov}}}(\hat{{\boldsymbol{\beta}}}) = ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} S^2. \]
58.4 Expected Values
Under the assumption that \({\rm E}[E_i] = 0\), \({\rm Var}(E_i) = \sigma^2\), and \({\operatorname{Cov}}(E_i, E_j) = 0\) for all \(1 \leq i, j \leq n\) and \(i \not= j\), we have the following:
\[ {\operatorname{E}}\left[ \left. \hat{{\boldsymbol{\beta}}} \right| {\boldsymbol{X}}\right] = {\boldsymbol{\beta}}\]
\[ {\operatorname{E}}\left[ \left. S^2 \right| {\boldsymbol{X}}\right] = \sigma^2 \]
\[ {\operatorname{E}}\left[\left. ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} S^2 \right| {\boldsymbol{X}}\right] = {\operatorname{Cov}}\left(\hat{{\boldsymbol{\beta}}}\right) \]
\[ {\operatorname{Cov}}\left(\hat{\beta}_j, Y_i - \hat{Y}_i\right) = \boldsymbol{0}. \]
58.5 Standard Error
The standard error of \(\hat{\beta}_j\) is the square root of the \((j, j)\) diagonal entry of \(({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2\)
\[ {\operatorname{se}}(\hat{\beta}_j) = \sqrt{\left[({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2\right]_{jj}} \]
and estimated standard error is
\[ \hat{{\operatorname{se}}}(\hat{\beta}_j) = \sqrt{\left[({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} S^2\right]_{jj}} \]
58.6 Proportion of Variance Explained
The proportion of variance explained is defined equivalently to the simple linear regression scneario:
\[ R^2 = \frac{\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2}{\sum_{i=1}^n (Y_i - \bar{Y})^2}. \]
58.7 Normal Errors
Suppose we assume \(E_1, E_2, \ldots, E_n {\; \stackrel{\text{iid}}{\sim}\;}\mbox{Normal}(0, \sigma^2)\). Then
\[ \ell\left({\boldsymbol{\beta}}, \sigma^2 ; {\boldsymbol{Y}}, {\boldsymbol{X}}\right) \propto -n\log(\sigma^2) -\frac{1}{\sigma^2} ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}}). \]
Since minimizing \(({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}})\) maximizes the likelihood with respect to \({\boldsymbol{\beta}}\), this implies \(\hat{{\boldsymbol{\beta}}}\) is the MLE for \({\boldsymbol{\beta}}\).
It can also be calculated that \(\frac{n-p}{n} S^2\) is the MLE for \(\sigma^2\).
58.8 Sampling Distribution
When \(E_1, E_2, \ldots, E_n {\; \stackrel{\text{iid}}{\sim}\;}\mbox{Normal}(0, \sigma^2)\), it follows that, conditional on \({\boldsymbol{X}}\):
\[ \hat{{\boldsymbol{\beta}}} \sim \mbox{MVN}_p\left({\boldsymbol{\beta}}, ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2 \right) \]
\[ \begin{aligned} S^2 \frac{n-p}{\sigma^2} & \sim \chi^2_{n-p} \\ \frac{\hat{\beta}_j - \beta_j}{\hat{{\operatorname{se}}}(\hat{\beta}_j)} & \sim t_{n-p} \end{aligned} \]
58.9 CLT
Under the assumption that \({\rm E}[E_i] = 0\), \({\rm Var}(E_i) = \sigma^2\), and \({\operatorname{Cov}}(E_i, E_j) = 0\) for \(i \not= j\), it follows that as \(n \rightarrow \infty\),
\[ \sqrt{n} \left(\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}}\right) \stackrel{D}{\longrightarrow} \mbox{MVN}_p\left( \boldsymbol{0}, ({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} \sigma^2 \right). \]
58.10 Gauss-Markov Theorem
Under the assumption that \({\rm E}[E_i] = 0\), \({\rm Var}(E_i) = \sigma^2\), and \({\operatorname{Cov}}(E_i, E_j) = 0\) for \(i \not= j\), the Gauss-Markov theorem shows that among all BLUEs, best linear unbiased estimators, the least squares estimate has the smallest mean-squared error.
Specifically, suppose that \(\tilde{{\boldsymbol{\beta}}}\) is a linear estimator (calculated from a linear operator on \({\boldsymbol{Y}}\)) where \({\operatorname{E}}[\tilde{{\boldsymbol{\beta}}} | {\boldsymbol{X}}] = {\boldsymbol{\beta}}\). Then
\[ {\operatorname{E}}\left[ \left. ({\boldsymbol{Y}}- {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}) \right| {\boldsymbol{X}}\right] \leq {\operatorname{E}}\left[ \left. ({\boldsymbol{Y}}- {\boldsymbol{X}}\tilde{{\boldsymbol{\beta}}})^T ({\boldsymbol{Y}}- {\boldsymbol{X}}\tilde{{\boldsymbol{\beta}}}) \right| {\boldsymbol{X}}\right]. \]