2024

Adding Columns in data.table

  • Use the := operator to add new columns by reference

  • This means that the original data.table is modified directly, rather than creating a new copy

  • This means we won’t need to reassign the data.table to itself after adding columns (no <- operator)

  • This makes adding columns very fast, even for large datasets

Adding Columns in data.table

  • When using the := operator, we don’t need to reassign the data.table to itself using the <- operator.
DT <- data.table(ID = 1:3, Name = c("Alice", "Bob", "Charlie"))

DT[, Age := c(25, 30, 35)]  # Adds a new column 'Age'

print(DT)
##       ID    Name   Age
##    <int>  <char> <num>
## 1:     1   Alice    25
## 2:     2     Bob    30
## 3:     3 Charlie    35

Updating Columns in data.table

  • The := operator is also used to update the values of existing columns in a data.table.

  • Still no <- operator needed!

DT[, Age := Age + 5]  # Updates the 'Age' column by adding 5 to each value
print(DT)
##       ID    Name   Age
##    <int>  <char> <num>
## 1:     1   Alice    30
## 2:     2     Bob    35
## 3:     3 Charlie    40

Removing Columns in data.table

  • The := operator can also be used to remove columns from a data.table.

  • You guessed it, no <- operator needed!

DT[, Age := NULL]  # Removes the 'Age' column
print(DT)
##       ID    Name
##    <int>  <char>
## 1:     1   Alice
## 2:     2     Bob
## 3:     3 Charlie

Conditional Updating

  • data.table allows for conditional updates. This means you can update or add columns based on conditions applied to other columns.
# Adds a 'Group' column based on a condition
DT[, Group := ifelse(Name == "Alice", "A", "B")]  
print(DT)
##       ID    Name  Group
##    <int>  <char> <char>
## 1:     1   Alice      A
## 2:     2     Bob      B
## 3:     3 Charlie      B

Adding Multiple Columns

  • You can add multiple columns at once by passing a list to the := operator.

  • The syntax .() is equivalent to list() when used in the context of data.table.

DT[, c("Height", "Weight") := .(c(160, 175, 180), c(60, 70, 75))]
print(DT)
##       ID    Name  Group Height Weight
##    <int>  <char> <char>  <num>  <num>
## 1:     1   Alice      A    160     60
## 2:     2     Bob      B    175     70
## 3:     3 Charlie      B    180     75

Question

  • Given a data.table DT with a column Score, what does the following code do?
DT[, Score := Score * 2]
  1. Doubles the value of a variable outside of DT named Score
  2. Creates a new table with all Score values doubled
  3. Doubles the values in the Score column for all rows in DT
  4. Generates an error because Score is not defined
Click here for the answer

The correct answer is C)

The := operator is used to update existing columns in data.table, in this case, doubling each value in the Score column.

Question

  • How do you remove a column named OldColumn from a data.table DT?
DT[, OldColumn := NULL] # A

remove(DT$OldColumn) # B

DT$OldColumn <- NULL # C

unset(DT, "OldColumn") # D
Click here for the answer
  1. DT[, OldColumn := NULL]
Using the := operator with NULL effectively removes the specified column from the data.table.

Question

  • What does the following code achieve in a data.table named DT?
DT[, Category := ifelse(Score > 50, "Pass", "Fail")]
  1. Filters DT to only include rows where Score is greater than 50

  2. Adds a new row named Category based on the condition

  3. Adds or updates a column named Category, assigning “Pass” or “Fail” based on the Score

  4. Divides the Score by 50 and assigns it to a new column Category

Answer

Click here for the answer
  1. Adds or updates a column named Category, assigning “Pass” or “Fail” based on the Score. This is an example of conditional updating using the := operator, where the new or updated column values depend on a condition applied to another column.

Question

  • What does the following code do?
DT[, c("mean", "err") := .(mean(Score), sd(Score) / sqrt(.N))]
  1. Calculates the mean and standard error of the Score column, assigning the results to global variables mean and err.

  2. Adds two new rows named mean and err with the calculated values.

  3. Adds or updates two columns named mean and err in DT, where mean is the average of Score and err is the standard error of Score.

  4. Generates an error because the syntax is incorrect.

Answer

Click here for the answer
  1. Adds or updates two columns named mean and err in DT, where mean is the average of Score and err is the standard error of Score. This line of code demonstrates how to perform grouped calculations and assign the results to new or existing columns in a data.table. The .N symbol represents the number of rows in each group (or in the whole table if not grouped), which is used to calculate the standard error of the Score.