Joins are used to combine two data tables based on a common key or keys.
E.g., you have a table of students and a table of their test scores, and you want to combine them into a single table.
2024
Joins are used to combine two data tables based on a common key or keys.
E.g., you have a table of students and a table of their test scores, and you want to combine them into a single table.
merge
functionmerge
function is used to join two data tables.merge(x, y, by = "ID", all.x = FALSE, all.y = FALSE)`
x
is the the left table, and y
is the right table to be merged.by: The column names on which to merge the tables.
by.x, by.y: Used when the columns to join on have different names in the two tables.
all, all.x, all.y: control the type of join:
A left join returns all rows from the left table, and the matched rows from the right table. The result is NULL in the right side when there is no match.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) left_join_result <- merge(DT1, DT2, by = "ID", all.x = TRUE) print(left_join_result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90
A right join returns all rows from the right table, and the matched rows from the left table. The result is NULL in the left side when there is no match.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) right_join_result <- merge(DT1, DT2, by = "ID", all.y = TRUE) print(right_join_result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 2 Bob 85 ## 2: 3 Charlie 90 ## 3: 4 <NA> 88
An inner join returns rows when there is at least one match in both tables. If there is no match, the rows are omitted.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) inner_join_result <- merge(DT1, DT2, by = "ID", all = FALSE) print(inner_join_result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 2 Bob 85 ## 2: 3 Charlie 90
An outer join returns all rows from both tables, with matching rows from both sides where available. If there is no match, the missing side will contain NULL.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) outer_join_result <- merge(DT1, DT2, by = "ID", all = TRUE) print(outer_join_result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90 ## 4: 4 <NA> 88
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) result <- merge(DT1, DT2, by = "ID", all.x = TRUE, all.y = TRUE)
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90 ## 4: 4 <NA> 88This is a full outer join –
all.x
and all.y
are both TRUE – so all rows from both tables are returned.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) result <- merge(DT1, DT2, by = "ID")
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 2 Bob 85 ## 2: 3 Charlie 90This performs an inner join, so only rows with matching
ID
values in both tables are returned. Rows with ID
2 and 3 will be included, each paired with their corresponding Name
and Score
.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) result <- merge(DT1, DT2, by = "ID", all.x = TRUE)
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90This is a left join –
all.x
is TRUE – so all rows from DT1
and only the matching rows from DT2
are returned. Rows with ID
1, 2, and 3 from DT1
are included, with NA
for Score
where there’s no match.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) result <- merge(DT1, DT2, by = "ID", all.y = TRUE)
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 2 Bob 85 ## 2: 3 Charlie 90 ## 3: 4 <NA> 88This is a right join –
all.y
is TRUE – so all rows from DT2
and only the matching rows from DT1
are returned. Rows with ID
2, 3, and 4 from DT2
are included, with NA
for Name
where there’s no match.
DT1 <- data.table(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) DT2 <- data.table(ID = c(2, 3, 4), Score = c(85, 90, 88)) result <- merge(DT1, DT2, by = "ID", all = TRUE)
print(result)
## Key: <ID> ## ID Name Score ## <num> <char> <num> ## 1: 1 Alice NA ## 2: 2 Bob 85 ## 3: 3 Charlie 90 ## 4: 4 <NA> 88This is a full outer join –
all
is TRUE – so all rows from both DT1
and DT2
are returned. Rows with ID
1, 2, 3, and 4 are included, with NA
for Name
or Score
where there’s no match.