7 Tidyverse

7.1 Idea

When the data are in tidy format, one can design functions around this format to consistently and intuitively perform data wrangling and analysis operations. The packages containing these are called the “tidyverse.”

Note: The idea of tidy data was first proposed by Hadley Wickham and he created several of the core packages, so this used to be called (semi-seriously) the “hadleyverse.”

7.2 Packages

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.

https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/

7.3 Primary Packages

  • dplyr: data manipulation
  • ggplot2: data visualization
  • purrr: functional programming
  • readr: data import
  • tibble: modernization of data frames
  • tidyr: data tidying

Loading tidyverse:

> library(tidyverse)

7.4 Tidying Data

7.4.1 tidyr Package

This package provides a variety of functions that allow one to tidy data.

Importantly, it solves two common ways that data come as untidy.

  1. gather(): Gathers a variable distributed across two or more columns into a single column.
  2. spread(): Spreads a column containing two or more variables into one column per variable.

7.4.2 Untidy Titanic Data

This does not satisfy the definition of tidy data because a variable’s observations are distributed as column names.

> df <- tibble(sex=c("male", "female"), 
+              survived=c(367, 344),
+              perished=c(1364, 126))
> df
# A tibble: 2 x 3
  sex    survived perished
  <chr>     <dbl>    <dbl>
1 male        367     1364
2 female      344      126

7.4.3 gather()

We apply the gather() function to make a column containing the survived and perished observations.

> df <- gather(df, survived, perished, 
+                key="fate", value="number")
> df
# A tibble: 4 x 3
  sex    fate     number
  <chr>  <chr>     <dbl>
1 male   survived    367
2 female survived    344
3 male   perished   1364
4 female perished    126

7.4.4 spread()

This example is here to show that spread() does the opposite operation as gather(). It isn’t used appropriately here because we revert the data back to untidy format.

> spread(df, key=fate, value=number)
# A tibble: 2 x 3
  sex    perished survived
  <chr>     <dbl>    <dbl>
1 female      126      344
2 male       1364      367

7.4.5 Tidy with spread()

Median cost of home and median income per city are two variables included in a single column. This means we need to use spread().

> df
# A tibble: 4 x 3
  city    median_value dollars
  <chr>   <chr>          <dbl>
1 Boston  home          527300
2 Boston  income         71738
3 Raleigh home          215700
4 Raleigh income         65778
> spread(df, key=median_value, value=dollars)
# A tibble: 2 x 3
  city      home income
  <chr>    <dbl>  <dbl>
1 Boston  527300  71738
2 Raleigh 215700  65778

7.5 Reshaping Data

7.5.1 Wide vs. Long Format

Tidy data are in “wide format” in that they have a column for each variable and there is one observed unit per row.

However, sometimes it’s useful to transform to “long format.” The simplest long format data have two columns. The first column contains the variable names and the second colum contains the values for the variables. There are “wider” long format data that have additional columns that identify connections between observations.

Wide format data is useful for some analyses and long format for others.

7.5.2 reshape2 Package

The reshape2 package has three important functions: melt, dcast, and acast. It allows one to move between wide and long tidy data formats.

> library("reshape2")
> library("datasets")
> data(airquality, package="datasets")
> names(airquality)
[1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"    
> dim(airquality)
[1] 153   6
> airquality <- as_tibble(airquality)

7.5.3 Air Quality Data Set

> head(airquality)
# A tibble: 6 x 6
  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <dbl> <int> <int> <int>
1    41     190   7.4    67     5     1
2    36     118   8      72     5     2
3    12     149  12.6    74     5     3
4    18     313  11.5    62     5     4
5    NA      NA  14.3    56     5     5
6    28      NA  14.9    66     5     6
> tail(airquality)
# A tibble: 6 x 6
  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <dbl> <int> <int> <int>
1    14      20  16.6    63     9    25
2    30     193   6.9    70     9    26
3    NA     145  13.2    77     9    27
4    14     191  14.3    75     9    28
5    18     131   8      76     9    29
6    20     223  11.5    68     9    30

7.5.4 Melt

Melting can be thought of as melting a piece of solid metal (wide data), so it drips into long format.

> aql <- melt(airquality)
No id variables; using all as measure variables
> head(aql)
  variable value
1    Ozone    41
2    Ozone    36
3    Ozone    12
4    Ozone    18
5    Ozone    NA
6    Ozone    28
> tail(aql)
    variable value
913      Day    25
914      Day    26
915      Day    27
916      Day    28
917      Day    29
918      Day    30

7.5.5 Guided Melt

In the previous example, we lose the fact that a set of measurements occurred on a particular day and month, so we can do a guided melt to keep this information.

> aql <- melt(airquality, id.vars = c("Month", "Day"))
> head(aql)
  Month Day variable value
1     5   1    Ozone    41
2     5   2    Ozone    36
3     5   3    Ozone    12
4     5   4    Ozone    18
5     5   5    Ozone    NA
6     5   6    Ozone    28
> tail(aql)
    Month Day variable value
607     9  25     Temp    63
608     9  26     Temp    70
609     9  27     Temp    77
610     9  28     Temp    75
611     9  29     Temp    76
612     9  30     Temp    68

7.5.6 Casting

Casting allows us to go from long format to wide format data. It can be visualized as pouring molten metal (long format) into a cast to create a solid piece of metal (wide format).

Casting is more difficult because choices have to be made to determine how the wide format will be organized. It often takes some thought and experimentation for new users.

Let’s do an example with dcast, which is casting for data frames.

7.5.7 dcast()

> aqw <- dcast(aql, Month + Day ~ variable)
> head(aqw)
  Month Day Ozone Solar.R Wind Temp
1     5   1    41     190  7.4   67
2     5   2    36     118  8.0   72
3     5   3    12     149 12.6   74
4     5   4    18     313 11.5   62
5     5   5    NA      NA 14.3   56
6     5   6    28      NA 14.9   66
> tail(aqw)
    Month Day Ozone Solar.R Wind Temp
148     9  25    14      20 16.6   63
149     9  26    30     193  6.9   70
150     9  27    NA     145 13.2   77
151     9  28    14     191 14.3   75
152     9  29    18     131  8.0   76
153     9  30    20     223 11.5   68