# 6 Tidy Data

## 6.1 Motivation

“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

From R for Data Science.

## 6.2 Definition

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

From Wickham (2014), “Tidy Data”, Journal of Statistical Software

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

From: Wickham H (2014), “Tidy Data”, Journal of Statistical Software

## 6.3 Example: Titanic Data

According to the Titanic data from the datasets package: 367 males survived, 1364 males perished, 344 females survived, and 126 females perished.

How should we organize these data?

### 6.3.1 Intuitive Format

Survived Perished
Male 367 1364
Female 344 126

### 6.3.2 Tidy Format

fate sex number
perished male 1364
perished female 126
survived male 367
survived female 344

## 6.4 Rules of Thumb

1. Something is a value if it represents different forms of a common object and it changes throughout the data set.
2. Something is a value if the data can be arranged so that it appears across rows within a column and this makes sense.

For example, fate and sex do not satisfy these criteria in the Titanic data, but perished/survived and female/male do.