9 Relational Data
9.1 Multiple Data Sets
In many data analyses you will have multiple tables of related data that must be combined in order to carry out your analysis.
The dplyr
package includes a number of tools to facilitate this.
9.2 Toy Example
Here are two data frames that are related through a common variable called key
.
> x <- tibble(key = c(1, 2, 3), x_val = c("x1", "x2", "x3"))
> y <- tibble(key = c(1, 2, 4), y_val = c("y1", "y2", "y4"))
> x
# A tibble: 3 x 2
key x_val
<dbl> <chr>
1 1 x1
2 2 x2
3 3 x3
> y
# A tibble: 3 x 2
key y_val
<dbl> <chr>
1 1 y1
2 2 y2
3 4 y4
9.3 Verbs
To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data.
- Mutating joins add new variables to one data frame from matching observations in another.
- Filtering joins filter observations from one data frame based on whether or not they match an observation in the other table.
- Set operations treat observations as if they were set elements.
From R for Data Science
9.4 inner_join()
An inner-join matches pairs of observations when their keys are equal.
> inner_join(x, y, key="key")
Joining, by = "key"
# A tibble: 2 x 3
key x_val y_val
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
9.5 left_join()
A left-join keeps all observations in the first argument, x
.
> left_join(x, y, key="key")
Joining, by = "key"
# A tibble: 3 x 3
key x_val y_val
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
> x %>% left_join(y, key="key")
Joining, by = "key"
# A tibble: 3 x 3
key x_val y_val
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
9.6 right_join()
A right-join keeps all observations in the second argument, y
.
> right_join(x, y)
Joining, by = "key"
# A tibble: 3 x 3
key x_val y_val
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y4
9.7 full_join()
A full-join keeps all observations in either argument, x
or y
.
> full_join(x, y, key="key")
Joining, by = "key"
# A tibble: 4 x 3
key x_val y_val
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
4 4 <NA> y4
9.8 anti_join()
An anti-join removes all observations in the first argument, x
, that appear in the second argument, y
.
> anti_join(x, y, key="key")
Joining, by = "key"
# A tibble: 1 x 2
key x_val
<dbl> <chr>
1 3 x3
9.9 semi_join()
A semi-join keeps all observations in the first argument, x
, that have a match in the second argument, y
.
> semi_join(x, y, key="key")
Joining, by = "key"
# A tibble: 2 x 2
key x_val
<dbl> <chr>
1 1 x1
2 2 x2
9.10 Repeated Key Values
When one of the two data frames has repeated key
values, the observations are repeated in the other data frame.
> y2
# A tibble: 4 x 2
key y_val
<dbl> <chr>
1 1 y1
2 2 y2a
3 2 y2b
4 4 y4
> x %>% left_join(y2, key="key")
Joining, by = "key"
# A tibble: 4 x 3
key x_val y_val
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2a
3 2 x2 y2b
4 3 x3 <NA>
9.11 Set Operations
One can perform traditional set operations on the rows of data frames.
intersect(x, y)
: return only observations in bothx
andy
union(x, y)
: return unique observations inx
andy
setdiff(x, y)
: return observations inx
, but not iny
From R for Data Science
Example setdiff()
> df1
# A tibble: 2 x 2
x y
<dbl> <dbl>
1 1 1
2 2 1
> df2
# A tibble: 2 x 2
x y
<dbl> <dbl>
1 1 1
2 1 2
> setdiff(df1, df2)
# A tibble: 1 x 2
x y
<dbl> <dbl>
1 2 1