9 Relational Data

9.1 Multiple Data Sets

In many data analyses you will have multiple tables of related data that must be combined in order to carry out your analysis.

The dplyr package includes a number of tools to facilitate this.

9.2 Toy Example

Here are two data frames that are related through a common variable called key.

> x <- tibble(key = c(1, 2, 3), x_val = c("x1", "x2", "x3"))
> y <- tibble(key = c(1, 2, 4), y_val = c("y1", "y2", "y4"))
> x
# A tibble: 3 x 2
    key x_val
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
> y
# A tibble: 3 x 2
    key y_val
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y4   

9.3 Verbs

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data.

  • Mutating joins add new variables to one data frame from matching observations in another.
  • Filtering joins filter observations from one data frame based on whether or not they match an observation in the other table.
  • Set operations treat observations as if they were set elements.

From R for Data Science

9.4 inner_join()

An inner-join matches pairs of observations when their keys are equal.

> inner_join(x, y, key="key")
Joining, by = "key"
# A tibble: 2 x 3
    key x_val y_val
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

9.5 left_join()

A left-join keeps all observations in the first argument, x.

> left_join(x, y, key="key")
Joining, by = "key"
# A tibble: 3 x 3
    key x_val y_val
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
> x %>% left_join(y, key="key")
Joining, by = "key"
# A tibble: 3 x 3
    key x_val y_val
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

9.6 right_join()

A right-join keeps all observations in the second argument, y.

> right_join(x, y)
Joining, by = "key"
# A tibble: 3 x 3
    key x_val y_val
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y4   

9.7 full_join()

A full-join keeps all observations in either argument, x or y.

> full_join(x, y, key="key")
Joining, by = "key"
# A tibble: 4 x 3
    key x_val y_val
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y4   

9.8 anti_join()

An anti-join removes all observations in the first argument, x, that appear in the second argument, y.

> anti_join(x, y, key="key")
Joining, by = "key"
# A tibble: 1 x 2
    key x_val
  <dbl> <chr>
1     3 x3   

9.9 semi_join()

A semi-join keeps all observations in the first argument, x, that have a match in the second argument, y.

> semi_join(x, y, key="key")
Joining, by = "key"
# A tibble: 2 x 2
    key x_val
  <dbl> <chr>
1     1 x1   
2     2 x2   

9.10 Repeated Key Values

When one of the two data frames has repeated key values, the observations are repeated in the other data frame.

> y2
# A tibble: 4 x 2
    key y_val
  <dbl> <chr>
1     1 y1   
2     2 y2a  
3     2 y2b  
4     4 y4   
> x %>% left_join(y2, key="key")
Joining, by = "key"
# A tibble: 4 x 3
    key x_val y_val
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2a  
3     2 x2    y2b  
4     3 x3    <NA> 

9.11 Set Operations

One can perform traditional set operations on the rows of data frames.

  • intersect(x, y): return only observations in both x and y
  • union(x, y): return unique observations in x and y
  • setdiff(x, y): return observations in x, but not in y

From R for Data Science

Example setdiff()

> df1
# A tibble: 2 x 2
      x     y
  <dbl> <dbl>
1     1     1
2     2     1
> df2
# A tibble: 2 x 2
      x     y
  <dbl> <dbl>
1     1     1
2     1     2
> setdiff(df1, df2)
# A tibble: 1 x 2
      x     y
  <dbl> <dbl>
1     2     1