Joins are used to combine two data tables based on a common key or keys.
E.g., you have a table of students and a table of their test scores, and you want to combine them into a single table.
2024
## Warning: package 'data.table' was built under R version 4.3.3
Joins are used to combine two data tables based on a common key or keys.
E.g., you have a table of students and a table of their test scores, and you want to combine them into a single table.
merge functionmerge function is used to join two data tables.merge(x, y, by = "ID", all.x = FALSE, all.y = FALSE)`
x is the the left table, and y is the right table to be merged.by: The column names on which to merge the tables.
by.x, by.y: Used when the columns to join on have different names in the two tables.
all, all.x, all.y: control the type of join:
A left join returns all rows from the left table, and the matched rows from the right table. The result is NULL in the right side when there is no match.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
left_join_result <- merge(DT1, DT2, by = "ID", all.x = TRUE)
print(left_join_result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90
A right join returns all rows from the right table, and the matched rows from the left table. The result is NULL in the left side when there is no match.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
right_join_result <- merge(DT1, DT2, by = "ID", all.y = TRUE)
print(right_join_result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 2 Bob 85 ## 2: 3 Charlie 90 ## 3: 4 <NA> 88
An inner join returns rows when there is at least one match in both tables. If there is no match, the rows are omitted.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
inner_join_result <- merge(DT1, DT2, by = "ID", all = FALSE)
print(inner_join_result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 2 Bob 85 ## 2: 3 Charlie 90
An outer join returns all rows from both tables, with matching rows from both sides where available. If there is no match, the missing side will contain NULL.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
outer_join_result <- merge(DT1, DT2, by = "ID", all = TRUE)
print(outer_join_result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90 ## 4: 4 <NA> 88
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
result <- merge(DT1, DT2, by = "ID", all.x = TRUE, all.y = TRUE)
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90 ## 4: 4 <NA> 88This is a full outer join –
all.x and all.y are both TRUE – so all rows from both tables are returned.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
result <- merge(DT1, DT2, by = "ID")
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 2 Bob 85 ## 2: 3 Charlie 90This performs an inner join, so only rows with matching
ID values in both tables are returned. Rows with ID 2 and 3 will be included, each paired with their corresponding Name and Score.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
result <- merge(DT1, DT2, by = "ID", all.x = TRUE)
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90This is a left join –
all.x is TRUE – so all rows from DT1 and only the matching rows from DT2 are returned. Rows with ID 1, 2, and 3 from DT1 are included, with NA for Score where there’s no match.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
result <- merge(DT1, DT2, by = "ID", all.y = TRUE)
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 2 Bob 85 ## 2: 3 Charlie 90 ## 3: 4 <NA> 88This is a right join –
all.y is TRUE – so all rows from DT2 and only the matching rows from DT1 are returned. Rows with ID 2, 3, and 4 from DT2 are included, with NA for Name where there’s no match.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88))
result <- merge(DT1, DT2, by = "ID", all = TRUE)
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90 ## 4: 4 <NA> 88This is a full outer join –
all is TRUE – so all rows from both DT1 and DT2 are returned. Rows with ID 1, 2, 3, and 4 are included, with NA for Name or Score where there’s no match.