# 12 Exploratory Data Analysis

## 12.1 What is EDA?

Exploratory data analysis (EDA) is the process of analzying data to uncover their key features.

John Tukey pioneered this framework, writing a seminal book on the topic (called Exploratory Data Analysis).

EDA involves calculating numerical summaries of data, visualizing data in a variety of ways, and considering interesting data points.

Before any model fitting is done to data, some exploratory data analysis should always be performed.

## 12.3 Components of EDA

EDA involves calculating quantities and visualizing data for:

• Basic sanity checks
• Checking for missing data
• Characterizing the distributional properties of the data
• Characterizing relationships among variables and observations
• Dimension reduction
• Model formulation
• Hypothesis generation

… and there are possible many more activities one can do.

## 12.4 Data Sets

For the majority of this chapter, we will use some simple data sets to demonstrate the ideas.

### 12.4.1 Data mtcars

Load the mtcars data set:

> library("tidyverse") # why load tidyverse?
> data("mtcars", package="datasets")
> mtcars <- as_tibble(mtcars)
# A tibble: 6 x 11
mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

### 12.4.2 Data mpg

Load the mpg data set:

> data("mpg", package="ggplot2")
# A tibble: 6 x 11
manufacturer model displ  year   cyl trans  drv     cty   hwy fl    class
<chr>        <chr> <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 audi         a4      1.8  1999     4 auto(… f        18    29 p     comp…
2 audi         a4      1.8  1999     4 manua… f        21    29 p     comp…
3 audi         a4      2    2008     4 manua… f        20    31 p     comp…
4 audi         a4      2    2008     4 auto(… f        21    30 p     comp…
5 audi         a4      2.8  1999     6 auto(… f        16    26 p     comp…
6 audi         a4      2.8  1999     6 manua… f        18    26 p     comp…

### 12.4.3 Data diamonds

Load the diamonds data set:

> data("diamonds", package="ggplot2")
# A tibble: 6 x 10
carat cut       color clarity depth table price     x     y     z
<dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

### 12.4.4 Data gapminder

Load the gapminder data set:

> library("gapminder")
> data("gapminder", package="gapminder")
> gapminder <- as_tibble(gapminder)
6 Afghanistan Asia       1977    38.4 14880372      786.