6 Tidy Data

6.1 Motivation

“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

From R for Data Science.

6.2 Definition

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

From Wickham (2014), “Tidy Data”, Journal of Statistical Software

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

From: Wickham H (2014), “Tidy Data”, Journal of Statistical Software

6.3 Example: Titanic Data

According to the Titanic data from the datasets package: 367 males survived, 1364 males perished, 344 females survived, and 126 females perished.

How should we organize these data?

6.3.1 Intuitive Format

Survived Perished
Male 367 1364
Female 344 126

6.3.2 Tidy Format

fate sex number
perished male 1364
perished female 126
survived male 367
survived female 344

6.4 Rules of Thumb

  1. Something is a value if it represents different forms of a common object and it changes throughout the data set.
  2. Something is a value if the data can be arranged so that it appears across rows within a column and this makes sense.

For example, fate and sex do not satisfy these criteria in the Titanic data, but perished/survived and female/male do.