62 Variable Transformations

62.1 Rationale

In order to obtain reliable model fits and inference on linear models, the model assumptions described earlier must be satisfied.

Sometimes it is necessary to transform the response variable and/or some of the explanatory variables.

This process should involve data visualization and exploration.

62.2 Power and Log Transformations

It is often useful to explore power and log transforms of the variables, e.g., \(\log(y)\) or \(y^\lambda\) for some \(\lambda\) (and likewise \(\log(x)\) or \(x^\lambda\)).

You can read more about the Box-Cox family of power transformations.

62.3 `Diamonds` Data

> data("diamonds", package="ggplot2")
> head(diamonds)
# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

62.4 Nonlinear Relationship

> ggplot(data = diamonds) +
+   geom_point(mapping=aes(x=carat, y=price, color=clarity), alpha=0.3)

62.5 Regression with Nonlinear Relationship

> diam_fit <- lm(price ~ carat + clarity, data=diamonds)
> anova(diam_fit)
Analysis of Variance Table

Response: price
             Df     Sum Sq    Mean Sq  F value    Pr(>F)    
carat         1 7.2913e+11 7.2913e+11 435639.9 < 2.2e-16 ***
clarity       7 3.9082e+10 5.5831e+09   3335.8 < 2.2e-16 ***
Residuals 53931 9.0264e+10 1.6737e+06                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

62.6 Residual Distribution

> plot(diam_fit, which=1)

62.7 Normal Residuals Check

> plot(diam_fit, which=2)

62.8 Log-Transformation

> ggplot(data = diamonds) +
+   geom_point(aes(x=carat, y=price, color=clarity), alpha=0.3) +
+   scale_y_log10(breaks=c(1000,5000,10000)) + 
+   scale_x_log10(breaks=1:5)

62.9 OLS on Log-Transformed Data

> diamonds <- mutate(diamonds, log_price = log(price, base=10), 
+                    log_carat = log(carat, base=10))
> ldiam_fit <- lm(log_price ~ log_carat + clarity, data=diamonds)
> anova(ldiam_fit)
Analysis of Variance Table

Response: log_price
             Df Sum Sq Mean Sq   F value    Pr(>F)    
log_carat     1 9771.9  9771.9 1452922.6 < 2.2e-16 ***
clarity       7  339.1    48.4    7203.3 < 2.2e-16 ***
Residuals 53931  362.7     0.0                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

62.10 Residual Distribution

> plot(ldiam_fit, which=1)

62.11 Normal Residuals Check

> plot(ldiam_fit, which=2)

62.12 Tree Pollen Study

Suppose that we have a study where tree pollen measurements are averaged every week, and these data are recorded for 10 years. These data are simulated:

> pollen_study
# A tibble: 520 x 3
    week  year pollen
   <int> <int>  <dbl>
 1     1  2001  1842.
 2     2  2001  1966.
 3     3  2001  2381.
 4     4  2001  2141.
 5     5  2001  2210.
 6     6  2001  2585.
 7     7  2001  2392.
 8     8  2001  2105.
 9     9  2001  2278.
10    10  2001  2384.
# … with 510 more rows

62.13 Tree Pollen Count by Week

> ggplot(pollen_study) + geom_point(aes(x=week, y=pollen))

62.14 A Clever Transformation

We can see there is a linear relationship between pollen and week if we transform week to be number of weeks from the peak week.

> pollen_study <- pollen_study %>%  
+                        mutate(week_new = abs(week-20))

Note that this is a very different transformation from taking a log or power transformation.

62.15 `week` Transformed

> ggplot(pollen_study) + geom_point(aes(x=week_new, y=pollen))