12 Exploratory Data Analysis
12.1 What is EDA?
Exploratory data analysis (EDA) is the process of analzying data to uncover their key features.
John Tukey pioneered this framework, writing a seminal book on the topic (called Exploratory Data Analysis).
EDA involves calculating numerical summaries of data, visualizing data in a variety of ways, and considering interesting data points.
Before any model fitting is done to data, some exploratory data analysis should always be performed.
12.2 Descriptive Statistics Examples
Facebook’s Visualizing Friendships (side note: a discussion)
Hans Rosling: Debunking third-world myths with the best stats you’ve ever seen
Flowing Data’s A Day in the Life of Americans
12.3 Components of EDA
EDA involves calculating quantities and visualizing data for:
- Basic sanity checks
- Checking for missing data
- Characterizing the distributional properties of the data
- Characterizing relationships among variables and observations
- Dimension reduction
- Model formulation
- Hypothesis generation
… and there are possible many more activities one can do.
12.4 Data Sets
For the majority of this chapter, we will use some simple data sets to demonstrate the ideas.
12.4.1 Data mtcars
Load the mtcars
data set:
> library("tidyverse") # why load tidyverse?
> data("mtcars", package="datasets")
> mtcars <- as_tibble(mtcars)
> head(mtcars)
# A tibble: 6 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
12.4.2 Data mpg
Load the mpg
data set:
> data("mpg", package="ggplot2")
> head(mpg)
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(… f 18 29 p comp…
2 audi a4 1.8 1999 4 manua… f 21 29 p comp…
3 audi a4 2 2008 4 manua… f 20 31 p comp…
4 audi a4 2 2008 4 auto(… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto(… f 16 26 p comp…
6 audi a4 2.8 1999 6 manua… f 18 26 p comp…
12.4.3 Data diamonds
Load the diamonds
data set:
> data("diamonds", package="ggplot2")
> head(diamonds)
# A tibble: 6 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
12.4.4 Data gapminder
Load the gapminder
data set:
> library("gapminder")
> data("gapminder", package="gapminder")
> gapminder <- as_tibble(gapminder)
> head(gapminder)
# A tibble: 6 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.