7.3 Primary Packages
dplyr
: data manipulationggplot2
: data visualizationpurrr
: functional programmingreadr
: data importtibble
: modernization of data framestidyr
: data tidying
Loading tidyverse
:
> library(tidyverse)
When the data are in tidy format, one can design functions around this format to consistently and intuitively perform data wrangling and analysis operations. The packages containing these are called the “tidyverse.”
Note: The idea of tidy data was first proposed by Hadley Wickham and he created several of the core packages, so this used to be called (semi-seriously) the “hadleyverse.”
The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The
tidyverse
package is designed to make it easy to install and load core packages from the tidyverse in a single command.
dplyr
: data manipulationggplot2
: data visualizationpurrr
: functional programmingreadr
: data importtibble
: modernization of data framestidyr
: data tidyingLoading tidyverse
:
> library(tidyverse)
tidyr
PackageThis package provides a variety of functions that allow one to tidy data.
Importantly, it solves two common ways that data come as untidy.
gather()
: Gathers a variable distributed across two or more columns into a single column.spread()
: Spreads a column containing two or more variables into one column per variable.This does not satisfy the definition of tidy data because a variable’s observations are distributed as column names.
> df <- tibble(sex=c("male", "female"),
+ survived=c(367, 344),
+ perished=c(1364, 126))
> df
# A tibble: 2 x 3
sex survived perished
<chr> <dbl> <dbl>
1 male 367 1364
2 female 344 126
gather()
We apply the gather()
function to make a column containing the survived
and perished
observations.
> df <- gather(df, survived, perished,
+ key="fate", value="number")
> df
# A tibble: 4 x 3
sex fate number
<chr> <chr> <dbl>
1 male survived 367
2 female survived 344
3 male perished 1364
4 female perished 126
spread()
This example is here to show that spread()
does the opposite operation as gather()
. It isn’t used appropriately here because we revert the data back to untidy format.
> spread(df, key=fate, value=number)
# A tibble: 2 x 3
sex perished survived
<chr> <dbl> <dbl>
1 female 126 344
2 male 1364 367
spread()
Median cost of home and median income per city are two variables included in a single column. This means we need to use spread()
.
> df
# A tibble: 4 x 3
city median_value dollars
<chr> <chr> <dbl>
1 Boston home 527300
2 Boston income 71738
3 Raleigh home 215700
4 Raleigh income 65778
> spread(df, key=median_value, value=dollars)
# A tibble: 2 x 3
city home income
<chr> <dbl> <dbl>
1 Boston 527300 71738
2 Raleigh 215700 65778
Tidy data are in “wide format” in that they have a column for each variable and there is one observed unit per row.
However, sometimes it’s useful to transform to “long format.” The simplest long format data have two columns. The first column contains the variable names and the second colum contains the values for the variables. There are “wider” long format data that have additional columns that identify connections between observations.
Wide format data is useful for some analyses and long format for others.
reshape2
PackageThe reshape2
package has three important functions: melt
, dcast
, and acast
. It allows one to move between wide and long tidy data formats.
> library("reshape2")
> library("datasets")
> data(airquality, package="datasets")
> names(airquality)
[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
> dim(airquality)
[1] 153 6
> airquality <- as_tibble(airquality)
> head(airquality)
# A tibble: 6 x 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> tail(airquality)
# A tibble: 6 x 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 14 20 16.6 63 9 25
2 30 193 6.9 70 9 26
3 NA 145 13.2 77 9 27
4 14 191 14.3 75 9 28
5 18 131 8 76 9 29
6 20 223 11.5 68 9 30
Melting can be thought of as melting a piece of solid metal (wide data), so it drips into long format.
> aql <- melt(airquality)
No id variables; using all as measure variables
> head(aql)
variable value
1 Ozone 41
2 Ozone 36
3 Ozone 12
4 Ozone 18
5 Ozone NA
6 Ozone 28
> tail(aql)
variable value
913 Day 25
914 Day 26
915 Day 27
916 Day 28
917 Day 29
918 Day 30
In the previous example, we lose the fact that a set of measurements occurred on a particular day and month, so we can do a guided melt to keep this information.
> aql <- melt(airquality, id.vars = c("Month", "Day"))
> head(aql)
Month Day variable value
1 5 1 Ozone 41
2 5 2 Ozone 36
3 5 3 Ozone 12
4 5 4 Ozone 18
5 5 5 Ozone NA
6 5 6 Ozone 28
> tail(aql)
Month Day variable value
607 9 25 Temp 63
608 9 26 Temp 70
609 9 27 Temp 77
610 9 28 Temp 75
611 9 29 Temp 76
612 9 30 Temp 68
Casting allows us to go from long format to wide format data. It can be visualized as pouring molten metal (long format) into a cast to create a solid piece of metal (wide format).
Casting is more difficult because choices have to be made to determine how the wide format will be organized. It often takes some thought and experimentation for new users.
Let’s do an example with dcast
, which is casting for data frames.
dcast()
> aqw <- dcast(aql, Month + Day ~ variable)
> head(aqw)
Month Day Ozone Solar.R Wind Temp
1 5 1 41 190 7.4 67
2 5 2 36 118 8.0 72
3 5 3 12 149 12.6 74
4 5 4 18 313 11.5 62
5 5 5 NA NA 14.3 56
6 5 6 28 NA 14.9 66
> tail(aqw)
Month Day Ozone Solar.R Wind Temp
148 9 25 14 20 16.6 63
149 9 26 30 193 6.9 70
150 9 27 NA 145 13.2 77
151 9 28 14 191 14.3 75
152 9 29 18 131 8.0 76
153 9 30 20 223 11.5 68