# 9 Relational Data

## 9.1 Multiple Data Sets

In many data analyses you will have multiple tables of related data that must be combined in order to carry out your analysis.

The dplyr package includes a number of tools to facilitate this.

## 9.2 Toy Example

Here are two data frames that are related through a common variable called key.

> x <- tibble(key = c(1, 2, 3), x_val = c("x1", "x2", "x3"))
> y <- tibble(key = c(1, 2, 4), y_val = c("y1", "y2", "y4"))
> x
# A tibble: 3 x 2
key x_val
<dbl> <chr>
1     1 x1
2     2 x2
3     3 x3
> y
# A tibble: 3 x 2
key y_val
<dbl> <chr>
1     1 y1
2     2 y2
3     4 y4   

## 9.3 Verbs

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data.

• Mutating joins add new variables to one data frame from matching observations in another.
• Filtering joins filter observations from one data frame based on whether or not they match an observation in the other table.
• Set operations treat observations as if they were set elements.

From R for Data Science

## 9.4inner_join()

An inner-join matches pairs of observations when their keys are equal.

> inner_join(x, y, key="key")
Joining, by = "key"
# A tibble: 2 x 3
key x_val y_val
<dbl> <chr> <chr>
1     1 x1    y1
2     2 x2    y2   

## 9.5left_join()

A left-join keeps all observations in the first argument, x.

> left_join(x, y, key="key")
Joining, by = "key"
# A tibble: 3 x 3
key x_val y_val
<dbl> <chr> <chr>
1     1 x1    y1
2     2 x2    y2
3     3 x3    <NA> 
> x %>% left_join(y, key="key")
Joining, by = "key"
# A tibble: 3 x 3
key x_val y_val
<dbl> <chr> <chr>
1     1 x1    y1
2     2 x2    y2
3     3 x3    <NA> 

## 9.6right_join()

A right-join keeps all observations in the second argument, y.

> right_join(x, y)
Joining, by = "key"
# A tibble: 3 x 3
key x_val y_val
<dbl> <chr> <chr>
1     1 x1    y1
2     2 x2    y2
3     4 <NA>  y4   

## 9.7full_join()

A full-join keeps all observations in either argument, x or y.

> full_join(x, y, key="key")
Joining, by = "key"
# A tibble: 4 x 3
key x_val y_val
<dbl> <chr> <chr>
1     1 x1    y1
2     2 x2    y2
3     3 x3    <NA>
4     4 <NA>  y4   

## 9.8anti_join()

An anti-join removes all observations in the first argument, x, that appear in the second argument, y.

> anti_join(x, y, key="key")
Joining, by = "key"
# A tibble: 1 x 2
key x_val
<dbl> <chr>
1     3 x3   

## 9.9semi_join()

A semi-join keeps all observations in the first argument, x, that have a match in the second argument, y.

> semi_join(x, y, key="key")
Joining, by = "key"
# A tibble: 2 x 2
key x_val
<dbl> <chr>
1     1 x1
2     2 x2   

## 9.10 Repeated Key Values

When one of the two data frames has repeated key values, the observations are repeated in the other data frame.

> y2
# A tibble: 4 x 2
key y_val
<dbl> <chr>
1     1 y1
2     2 y2a
3     2 y2b
4     4 y4   
> x %>% left_join(y2, key="key")
Joining, by = "key"
# A tibble: 4 x 3
key x_val y_val
<dbl> <chr> <chr>
1     1 x1    y1
2     2 x2    y2a
3     2 x2    y2b
4     3 x3    <NA> 

## 9.11 Set Operations

One can perform traditional set operations on the rows of data frames.

• intersect(x, y): return only observations in both x and y
• union(x, y): return unique observations in x and y
• setdiff(x, y): return observations in x, but not in y

From R for Data Science

#### Example setdiff()

> df1
# A tibble: 2 x 2
x     y
<dbl> <dbl>
1     1     1
2     2     1
> df2
# A tibble: 2 x 2
x     y
<dbl> <dbl>
1     1     1
2     1     2
> setdiff(df1, df2)
# A tibble: 1 x 2
x     y
<dbl> <dbl>
1     2     1