# 56 Simple Linear Regression

## 56.1 Definition

For random variables $$(X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)$$, simple linear regression estimates the model

$Y_i = \beta_1 + \beta_2 X_i + E_i$

where $${\operatorname{E}}[E_i] = 0$$, $${\operatorname{Var}}(E_i) = \sigma^2$$, and $${\operatorname{Cov}}(E_i, E_j) = 0$$ for all $$1 \leq i, j \leq n$$ and $$i \not= j$$.

## 56.2 Rationale

• Least squares linear regression is one of the simplest and most useful modeling systems for building a model that explains the variation of one variable in terms of other variables.

• It is simple to fit, it satisfies some optimality criteria, and it is straightforward to check assumptions on the data so that statistical inference can be performed.

## 56.3 Setup

• Suppose that we have observed $$n$$ pairs of data $$(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$$.

• Least squares linear regression models variation of the response variable $$y$$ in terms of the explanatory variable $$x$$ in the form of $$\beta_1 + \beta_2 x$$, where $$\beta_1$$ and $$\beta_2$$ are chosen to satisfy a least squares optimization.

## 56.4 Line Minimizing Squared Error

The least squares regression line is formed from the value of $$\beta_1$$ and $$\beta_2$$ that minimize:

$\sum_{i=1}^n \left( y_i - \beta_1 - \beta_2 x_i \right)^2.$

For a given set of data, there is a unique solution to this minimization as long as there are at least two unique values among $$x_1, x_2, \ldots, x_n$$.

Let $$\hat{\beta_1}$$ and $$\hat{\beta_2}$$ be the values that minimize this sum of squares.

## 56.5 Least Squares Solution

These values are:

$\hat{\beta}_2 = r_{xy} \frac{s_y}{s_x}$

$\hat{\beta}_1 = \overline{y} - \hat{\beta}_2 \overline{x}$

These values have a useful interpretation.

## 56.7 Example: Height and Weight

> ggplot(data=htwt, mapping=aes(x=height, y=weight)) +
+   geom_point(size=2, alpha=0.5) +
+   geom_smooth(method="lm", se=FALSE, formula=y~x)

## 56.8 Calculate the Line Directly

> beta2 <- cor(htwt$height, htwt$weight) *
+                sd(htwt$weight) / sd(htwt$height)
> beta2
[1] 1.150092
>
> beta1 <- mean(htwt$weight) - beta2 * mean(htwt$height)
> beta1
[1] -130.9104
>
> yhat <- beta1 + beta2 * htwt\$height

## 56.9 Plot the Line

> df <- data.frame(htwt, yhat=yhat)
> ggplot(data=df) + geom_point(aes(x=height, y=weight), size=2, alpha=0.5) +
+   geom_line(aes(x=height, y=yhat), color="blue", size=1.2)

## 56.10 Observed Data, Fits, and Residuals

We observe data $$(x_1, y_1), \ldots, (x_n, y_n)$$. Note that we only observe $$X_i$$ and $$Y_i$$ from the generative model $$Y_i = \beta_1 + \beta_2 X_i + E_i$$.

We calculate fitted values and observed residuals:

$\hat{y}_i = \hat{\beta}_1 + \hat{\beta}_2 x_i$

$\hat{e}_i = y_i - \hat{y}_i$

By construction, it is the case that $$\sum_{i=1}^n \hat{e}_i = 0$$.

## 56.11 Proportion of Variation Explained

The proportion of variance explained by the fitted model is called $$R^2$$ or $$r^2$$. It is calculated by:

$r^2 = \frac{s^2_{\hat{y}}}{s^2_{y}}$