56 Simple Linear Regression

56.1 Definition

For random variables \((X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)\), simple linear regression estimates the model

\[ Y_i = \beta_1 + \beta_2 X_i + E_i \]

where \({\operatorname{E}}[E_i] = 0\), \({\operatorname{Var}}(E_i) = \sigma^2\), and \({\operatorname{Cov}}(E_i, E_j) = 0\) for all \(1 \leq i, j \leq n\) and \(i \not= j\).

56.2 Rationale

  • Least squares linear regression is one of the simplest and most useful modeling systems for building a model that explains the variation of one variable in terms of other variables.

  • It is simple to fit, it satisfies some optimality criteria, and it is straightforward to check assumptions on the data so that statistical inference can be performed.

56.3 Setup

  • Suppose that we have observed \(n\) pairs of data \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\).

  • Least squares linear regression models variation of the response variable \(y\) in terms of the explanatory variable \(x\) in the form of \(\beta_1 + \beta_2 x\), where \(\beta_1\) and \(\beta_2\) are chosen to satisfy a least squares optimization.

56.4 Line Minimizing Squared Error

The least squares regression line is formed from the value of \(\beta_1\) and \(\beta_2\) that minimize:

\[\sum_{i=1}^n \left( y_i - \beta_1 - \beta_2 x_i \right)^2.\]

For a given set of data, there is a unique solution to this minimization as long as there are at least two unique values among \(x_1, x_2, \ldots, x_n\).

Let \(\hat{\beta_1}\) and \(\hat{\beta_2}\) be the values that minimize this sum of squares.

56.5 Least Squares Solution

These values are:

\[\hat{\beta}_2 = r_{xy} \frac{s_y}{s_x}\]

\[\hat{\beta}_1 = \overline{y} - \hat{\beta}_2 \overline{x}\]

These values have a useful interpretation.

56.6 Visualizing Least Squares Line

56.7 Example: Height and Weight

> ggplot(data=htwt, mapping=aes(x=height, y=weight)) + 
+   geom_point(size=2, alpha=0.5) +
+   geom_smooth(method="lm", se=FALSE, formula=y~x)

56.8 Calculate the Line Directly

> beta2 <- cor(htwt$height, htwt$weight) * 
+                sd(htwt$weight) / sd(htwt$height)
> beta2
[1] 1.150092
> 
> beta1 <- mean(htwt$weight) - beta2 * mean(htwt$height)
> beta1
[1] -130.9104
> 
> yhat <- beta1 + beta2 * htwt$height

56.9 Plot the Line

> df <- data.frame(htwt, yhat=yhat)
> ggplot(data=df) + geom_point(aes(x=height, y=weight), size=2, alpha=0.5) +
+   geom_line(aes(x=height, y=yhat), color="blue", size=1.2)

56.10 Observed Data, Fits, and Residuals

We observe data \((x_1, y_1), \ldots, (x_n, y_n)\). Note that we only observe \(X_i\) and \(Y_i\) from the generative model \(Y_i = \beta_1 + \beta_2 X_i + E_i\).

We calculate fitted values and observed residuals:

\[\hat{y}_i = \hat{\beta}_1 + \hat{\beta}_2 x_i\]

\[\hat{e}_i = y_i - \hat{y}_i\]

By construction, it is the case that \(\sum_{i=1}^n \hat{e}_i = 0\).

56.11 Proportion of Variation Explained

The proportion of variance explained by the fitted model is called \(R^2\) or \(r^2\). It is calculated by:

\[r^2 = \frac{s^2_{\hat{y}}}{s^2_{y}}\]