6 Tidy Data
6.1 Motivation
“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham
From R for Data Science.
6.2 Definition
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
From Wickham (2014), “Tidy Data”, Journal of Statistical Software
A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
From: Wickham H (2014), “Tidy Data”, Journal of Statistical Software
6.3 Example: Titanic Data
According to the Titanic
data from the datasets
package: 367 males survived, 1364 males perished, 344 females survived, and 126 females perished.
How should we organize these data?
6.3.1 Intuitive Format
Survived | Perished | |
---|---|---|
Male | 367 | 1364 |
Female | 344 | 126 |
6.3.2 Tidy Format
fate | sex | number |
---|---|---|
perished | male | 1364 |
perished | female | 126 |
survived | male | 367 |
survived | female | 344 |
6.4 Rules of Thumb
- Something is a value if it represents different forms of a common object and it changes throughout the data set.
- Something is a value if the data can be arranged so that it appears across rows within a column and this makes sense.
For example, fate
and sex
do not satisfy these criteria in the Titanic
data, but perished
/survived
and female
/male
do.