19.9 Specifying join keys

-By default, left_join() will use all variables that appear in both data frames as the join key, the so called natural join.

  • This is a useful heuristic, but it doesn’t always work, like when joining flights and planes which each have a year column but they mean different things.
flights2 |> 
  left_join(planes)
## Joining with `by = join_by(year, tailnum)`
## # A tibble: 336,776 × 13
##     year time_hour           origin dest  tailnum carrier type  manufacturer
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>        
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>        
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      <NA>  <NA>        
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      <NA>  <NA>        
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      <NA>  <NA>        
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>  <NA>        
## # ℹ 336,766 more rows
## # ℹ 5 more variables: model <chr>, engines <int>, seats <int>, speed <int>,
## #   engine <chr>
  • We get a lot of missing matches because our join is trying to use tailnum and year as a compound key. In this case, we only want to join on tailnum so we need to provide an explicit specification with join_by(), where join_by(tailnum) is short for join_by(tailnum == tailnum).
flights2 |> 
  left_join(planes, join_by(tailnum))
## # A tibble: 336,776 × 14
##    year.x time_hour           origin dest  tailnum carrier year.y type          
##     <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>         
##  1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing mu…
##  2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing mu…
##  3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing mu…
##  4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing mu…
##  5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed wing mu…
##  6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed wing mu…
##  7   2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6        2000 Fixed wing mu…
##  8   2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV        1998 Fixed wing mu…
##  9   2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6        2004 Fixed wing mu…
## 10   2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA          NA <NA>          
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>
  • You can also specify different join keys in each table. For example, there are two ways to join the flight2 and airports table: either by dest or origin:
flights2 |> 
  left_join(airports, join_by(dest == faa))
## # A tibble: 336,776 × 13
##     year time_hour           origin dest  tailnum carrier name         lat   lon
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>      <dbl> <dbl>
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George Bu…  30.0 -95.3
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George Bu…  30.0 -95.3
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami Intl  25.8 -80.3
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>        NA    NA  
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfiel…  33.6 -84.4
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago O…  42.0 -87.9
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      Fort Laud…  26.1 -80.2
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      Washingto…  38.9 -77.5
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      Orlando I…  28.4 -81.3
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      Chicago O…  42.0 -87.9
## # ℹ 336,766 more rows
## # ℹ 4 more variables: alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
flights2 |> 
  left_join(airports, join_by(origin == faa))
## # A tibble: 336,776 × 13
##     year time_hour           origin dest  tailnum carrier name         lat   lon
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>      <dbl> <dbl>
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newark Li…  40.7 -74.2
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La Guardia  40.8 -73.9
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John F Ke…  40.6 -73.8
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John F Ke…  40.6 -73.8
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La Guardia  40.8 -73.9
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newark Li…  40.7 -74.2
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      Newark Li…  40.7 -74.2
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      La Guardia  40.8 -73.9
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      John F Ke…  40.6 -73.8
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      La Guardia  40.8 -73.9
## # ℹ 336,766 more rows
## # ℹ 4 more variables: alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>