6 Tidy Data

6.1 Motivation

“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

From R for Data Science.

6.2 Definition

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

From Wickham (2014), “Tidy Data”, Journal of Statistical Software

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

From: Wickham H (2014), “Tidy Data”, Journal of Statistical Software

6.3 Example: Titanic Data

According to the Titanic data from the datasets package: 367 males survived, 1364 males perished, 344 females survived, and 126 females perished.

How should we organize these data?

6.3.1 Intuitive Format

	Survived	Perished
Male	367	1364
Female	344	126

6.3.2 Tidy Format

fate	sex	number
perished	male	1364
perished	female	126
survived	male	367
survived	female	344

6.4 Rules of Thumb

Something is a value if it represents different forms of a common object and it changes throughout the data set.
Something is a value if the data can be arranged so that it appears across rows within a column and this makes sense.

For example, fate and sex do not satisfy these criteria in the Titanic data, but perished/survived and female/male do.