Joins in data.table

2024

## Warning: package 'data.table' was built under R version 4.3.3

Basic Concept of Joins

Joins are used to combine two data tables based on a common key or keys.
E.g., you have a table of students and a table of their test scores, and you want to combine them into a single table.

Join using the `merge` function

The merge function is used to join two data tables.

merge(x, y, by = "ID", all.x = FALSE, all.y = FALSE)`

x, y: x is the the left table, and y is the right table to be merged.

by: The column names on which to merge the tables.
by.x, by.y: Used when the columns to join on have different names in the two tables.
all, all.x, all.y: control the type of join:

Left Join

A left join returns all rows from the left table, and the matched rows from the right table. The result is NULL in the right side when there is no match.

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

left_join_result <- merge(DT1, DT2, by = "ID", all.x = TRUE)
print(left_join_result)

## Key: <ID>
##       ID    Name Score
##    <num>  <char> <num>
## 1:     1   Alice    NA
## 2:     2     Bob    85
## 3:     3 Charlie    90

Right Join

A right join returns all rows from the right table, and the matched rows from the left table. The result is NULL in the left side when there is no match.

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

right_join_result <- merge(DT1, DT2, by = "ID", all.y = TRUE)
print(right_join_result)

## Key: <ID>
##       ID    Name Score
##    <num>  <char> <num>
## 1:     2     Bob    85
## 2:     3 Charlie    90
## 3:     4    <NA>    88

Inner Join

An inner join returns rows when there is at least one match in both tables. If there is no match, the rows are omitted.

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

inner_join_result <- merge(DT1, DT2, by = "ID", all = FALSE)
print(inner_join_result)

## Key: <ID>
##       ID    Name Score
##    <num>  <char> <num>
## 1:     2     Bob    85
## 2:     3 Charlie    90

Outer Join

An outer join returns all rows from both tables, with matching rows from both sides where available. If there is no match, the missing side will contain NULL.

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

outer_join_result <- merge(DT1, DT2, by = "ID", all = TRUE)
print(outer_join_result)

## Key: <ID>
##       ID    Name Score
##    <num>  <char> <num>
## 1:     1   Alice    NA
## 2:     2     Bob    85
## 3:     3 Charlie    90
## 4:     4    <NA>    88

Question

What is the result of the following code?

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

result <- merge(DT1, DT2, by = "ID", all.x = TRUE, all.y = TRUE)

Answer

Click here for the answer

  print(result)

  ## Key: <ID>
  ##       ID    Name Score
  ##    <num>  <char> <num>
  ## 1:     1   Alice    NA
  ## 2:     2     Bob    85
  ## 3:     3 Charlie    90
  ## 4:     4    <NA>    88

This is a full outer join – all.x and all.y are both TRUE – so all rows from both tables are returned.

Question

What is the result of the following code?

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

result <- merge(DT1, DT2, by = "ID")

Answer

Click here for the answer

  print(result)

  ## Key: <ID>
  ##       ID    Name Score
  ##    <num>  <char> <num>
  ## 1:     2     Bob    85
  ## 2:     3 Charlie    90

This performs an inner join, so only rows with matching ID values in both tables are returned. Rows with ID 2 and 3 will be included, each paired with their corresponding Name and Score.

Question

What is the result of the following code?

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

result <- merge(DT1, DT2, by = "ID", all.x = TRUE)

Answer

Click here for the answer

  print(result)

  ## Key: <ID>
  ##       ID    Name Score
  ##    <num>  <char> <num>
  ## 1:     1   Alice    NA
  ## 2:     2     Bob    85
  ## 3:     3 Charlie    90

This is a left join – all.x is TRUE – so all rows from DT1 and only the matching rows from DT2 are returned. Rows with ID 1, 2, and 3 from DT1 are included, with NA for Score where there’s no match.

Question

What is the result of the following code?

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

result <- merge(DT1, DT2, by = "ID", all.y = TRUE)

Answer

Click here for the answer

  print(result)

  ## Key: <ID>
  ##       ID    Name Score
  ##    <num>  <char> <num>
  ## 1:     2     Bob    85
  ## 2:     3 Charlie    90
  ## 3:     4    <NA>    88

This is a right join – all.y is TRUE – so all rows from DT2 and only the matching rows from DT1 are returned. Rows with ID 2, 3, and 4 from DT2 are included, with NA for Name where there’s no match.

Question

What is the result of the following code?

DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))

result <- merge(DT1, DT2, by = "ID", all = TRUE)

Answer

Click here for the answer

  print(result)

  ## Key: <ID>
  ##       ID    Name Score
  ##    <num>  <char> <num>
  ## 1:     1   Alice    NA
  ## 2:     2     Bob    85
  ## 3:     3 Charlie    90
  ## 4:     4    <NA>    88

This is a full outer join – all is TRUE – so all rows from both DT1 and DT2 are returned. Rows with ID 1, 2, 3, and 4 are included, with NA for Name or Score where there’s no match.

Basic Concept of Joins

Join using the merge function

Left Join

Right Join

Inner Join

Outer Join

Question

Answer

Question

Answer

Question

Answer

Question

Answer

Question

Answer

Join using the `merge` function