# 4 Exploratory Data Analysis

## 4.1 What is EDA?

Exploratory data analysis (EDA) is the process of analzying data to uncover their key features.

John Tukey pioneered this framework, writing a seminal book on the topic (called Exploratory Data Analysis).

EDA involves calculating numerical summaries of data, visualizing data in a variety of ways, and considering interesting data points.

Before any model fitting is done to data, some exploratory data analysis should always be performed.

Data science seems to focus much more on EDA than traditional statistics.

## 4.3 Components of EDA

EDA involves calculating quantities and visualizing data for:

• Checking the n’s
• Checking for missing data
• Characterizing the distributional properties of the data
• Characterizing relationships among variables and observations
• Dimension reduction
• Model formulation
• Hypothesis generation

… and there are possible many more activities one can do.

## 4.4 Data Sets

For the majority of this chapter, we will use some simple data sets to demonstrate the ideas.

### 4.4.1 Data mtcars

Load the mtcars data set:

> library("tidyverse") # why load tidyverse?
> data("mtcars", package="datasets")
> mtcars <- as_tibble(mtcars)
# A tibble: 6 x 11
mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

### 4.4.2 Data mpg

Load the mpg data set:

> data("mpg", package="ggplot2")
# A tibble: 6 x 11
manufacturer model displ  year   cyl trans  drv     cty   hwy fl    class
<chr>        <chr> <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 audi         a4      1.8  1999     4 auto(… f        18    29 p     comp…
2 audi         a4      1.8  1999     4 manua… f        21    29 p     comp…
3 audi         a4      2    2008     4 manua… f        20    31 p     comp…
4 audi         a4      2    2008     4 auto(… f        21    30 p     comp…
5 audi         a4      2.8  1999     6 auto(… f        16    26 p     comp…
6 audi         a4      2.8  1999     6 manua… f        18    26 p     comp…

### 4.4.3 Data diamonds

Load the diamonds data set:

> data("diamonds", package="ggplot2")
# A tibble: 6 x 10
carat cut       color clarity depth table price     x     y     z
<dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

### 4.4.4 Data gapminder

Load the gapminder data set:

> library("gapminder")
> data("gapminder", package="gapminder")
> gapminder <- as_tibble(gapminder)
6 Afghanistan Asia       1977    38.4 14880372      786.