3 Introduction to data.table

This chapter introduces the data.table library, a high-performance alternative to R’s standard data.frame. We will cover the basics of working with data.table objects, focusing on key operations such as creating data tables, subsetting rows, selecting and manipulating columns, and grouping operations. It explains the data.table syntax DT[i, j, by], where i filters rows, j selects columns or performs calculations, and by allows for group-wise operations.

Practical examples demonstrate how to:

  • Load the data.table library and create a data.table.

  • Subset rows using integers, logical expressions, and the %in% operator.

  • Select and operate on columns, including summarizing data with functions like mean() and sd().

  • Combine row selection (i) and column operations (j) for more complex data manipulation.

  • Group data using the by argument to apply functions within groups.

  • Add and modify columns within a data.table using the := operator.

  • Manage data table copies and understand the difference between shallow and deep copies.

  • Import data into a data.table using the fread function.

We will also see how to deal with different data formatting, using melt and dcast functions for reshaping data tables from wide format to long or vice versa.

3.1 Creating a data.table

# Load the data.table library 
library(data.table)

# Create vectors to later store in a data frame
x <- c('I', 'I', 'I', 'II', 'II', 'II', 'III', 'III', 'III', 'IV', 'IV', 'IV')
y <- c('a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'e', 'e', 'f', 'f')
z <- rnorm(12)

# create a data table
dt <- data.table(x, y, z)
dt
##       x y            z
##  1:   I a  0.801076986
##  2:   I a  0.886827043
##  3:   I b -1.216112121
##  4:  II b -1.494334054
##  5:  II c  0.467194196
##  6:  II c -0.530958683
##  7: III d -0.891427670
##  8: III d  0.787302046
##  9: III e  0.006745659
## 10:  IV e -1.836147935
## 11:  IV f  0.052353254
## 12:  IV f -1.018931991

3.2 Subset rows in i

  • There are many ways to select rows.

  • Set i equal to a vector of integers.

  • Set i equal to a logical expression.

  • It will sometimes be useful to use the %in% operator when specifying i.

# select rows by passing integer indices
dt[1:4]
##     x y         z
## 1:  I a  0.801077
## 2:  I a  0.886827
## 3:  I b -1.216112
## 4: II b -1.494334
dt[4:7]
##      x y          z
## 1:  II b -1.4943341
## 2:  II c  0.4671942
## 3:  II c -0.5309587
## 4: III d -0.8914277
dt[c(2, 4)]
##     x y         z
## 1:  I a  0.886827
## 2: II b -1.494334
dt[c(2, 5, 11)]
##     x y          z
## 1:  I a 0.88682704
## 2: II c 0.46719420
## 3: IV f 0.05235325
# select rows by passing a logical expression
dt[x=='II']
##     x y          z
## 1: II b -1.4943341
## 2: II c  0.4671942
## 3: II c -0.5309587
dt[(x=='II') & (y=='c')]
##     x y          z
## 1: II c  0.4671942
## 2: II c -0.5309587
dt[(x=='II') & (y=='c') & (z > 0)]
##     x y         z
## 1: II c 0.4671942
# select rows by implementing a logical expression using
# `chaining`
dt[x=='II'][y=='c'][z > 0]
##     x y         z
## 1: II c 0.4671942
# select rows using the `%in%` operator
dt[x %in% c('I', 'II')]
##     x y          z
## 1:  I a  0.8010770
## 2:  I a  0.8868270
## 3:  I b -1.2161121
## 4: II b -1.4943341
## 5: II c  0.4671942
## 6: II c -0.5309587
dt[y %in% c('b', 'f')]
##     x y           z
## 1:  I b -1.21611212
## 2: II b -1.49433405
## 3: IV f  0.05235325
## 4: IV f -1.01893199
# Notice that using the `%in%` operator is essentially
# performing an *or* logical operation
dt[(x=='I') | (x=='II')]
##     x y          z
## 1:  I a  0.8010770
## 2:  I a  0.8868270
## 3:  I b -1.2161121
## 4: II b -1.4943341
## 5: II c  0.4671942
## 6: II c -0.5309587
dt[(y=='b') | (y=='f')]
##     x y           z
## 1:  I b -1.21611212
## 2: II b -1.49433405
## 3: IV f  0.05235325
## 4: IV f -1.01893199
  • If you want to operate on every row of some set of columns, leave the i argument blank.

  • This will result in the first character inside the square brackets being a comma, which looks a bit strange if you’re not used to it. Just remember to read it as select every row.

  • data.table also provides a keyword .N that can be useful in a number of ways. One such way is to use it in selecting rows as shown in the next code chunk.

dt # select every row
##       x y            z
##  1:   I a  0.801076986
##  2:   I a  0.886827043
##  3:   I b -1.216112121
##  4:  II b -1.494334054
##  5:  II c  0.467194196
##  6:  II c -0.530958683
##  7: III d -0.891427670
##  8: III d  0.787302046
##  9: III e  0.006745659
## 10:  IV e -1.836147935
## 11:  IV f  0.052353254
## 12:  IV f -1.018931991
dt[,] # select every row
##       x y            z
##  1:   I a  0.801076986
##  2:   I a  0.886827043
##  3:   I b -1.216112121
##  4:  II b -1.494334054
##  5:  II c  0.467194196
##  6:  II c -0.530958683
##  7: III d -0.891427670
##  8: III d  0.787302046
##  9: III e  0.006745659
## 10:  IV e -1.836147935
## 11:  IV f  0.052353254
## 12:  IV f -1.018931991
dt[1:.N] # select every row using the .N keyword
##       x y            z
##  1:   I a  0.801076986
##  2:   I a  0.886827043
##  3:   I b -1.216112121
##  4:  II b -1.494334054
##  5:  II c  0.467194196
##  6:  II c -0.530958683
##  7: III d -0.891427670
##  8: III d  0.787302046
##  9: III e  0.006745659
## 10:  IV e -1.836147935
## 11:  IV f  0.052353254
## 12:  IV f -1.018931991
dt[2:.N] # select all but the first row using the .N keyword
##       x y            z
##  1:   I a  0.886827043
##  2:   I b -1.216112121
##  3:  II b -1.494334054
##  4:  II c  0.467194196
##  5:  II c -0.530958683
##  6: III d -0.891427670
##  7: III d  0.787302046
##  8: III e  0.006745659
##  9:  IV e -1.836147935
## 10:  IV f  0.052353254
## 11:  IV f -1.018931991

3.3 Select columns with j

  • To operate on specific columns, set j equal to a list of the names of the columns you want.

  • Inside the square brackets of a data.table, list can be abbreviated with a ., which can also look strage at first. Once you get used to it, you will appreciate the brevity.

3.4 Select and operate on columns with j

  • You can do a lot more than just return the columns specified in j. In fact, you can perform any operation you want on them.
# Select every row of columns y and z the long way
dt[, list(y, z)]
##     y            z
##  1: a  0.801076986
##  2: a  0.886827043
##  3: b -1.216112121
##  4: b -1.494334054
##  5: c  0.467194196
##  6: c -0.530958683
##  7: d -0.891427670
##  8: d  0.787302046
##  9: e  0.006745659
## 10: e -1.836147935
## 11: f  0.052353254
## 12: f -1.018931991
# Select every row of columns y and z the cool way
dt[, .(y, z)]
##     y            z
##  1: a  0.801076986
##  2: a  0.886827043
##  3: b -1.216112121
##  4: b -1.494334054
##  5: c  0.467194196
##  6: c -0.530958683
##  7: d -0.891427670
##  8: d  0.787302046
##  9: e  0.006745659
## 10: e -1.836147935
## 11: f  0.052353254
## 12: f -1.018931991
# select rows 1:4 and return the z column of the result as a
# vector
dt[1:4, z]
## [1]  0.801077  0.886827 -1.216112 -1.494334
# select rows 1:4 and return the z column of the result as a
# data.table
dt[1:4, list(z)]
##            z
## 1:  0.801077
## 2:  0.886827
## 3: -1.216112
## 4: -1.494334
# select rows 1:4 and return the z column of the result as a
# data.table, but use `.` to abbreviate `list`. This is a
# thing specific to data.table.
dt[1:4, .(z)]
##            z
## 1:  0.801077
## 2:  0.886827
## 3: -1.216112
## 4: -1.494334
# select and operate on columns
# select rows 1:4 and return a summary statistic of the
# resulting z column
dt[1:4, mean(z)]
## [1] -0.2556355
dt[1:4, sd(z)]
## [1] 1.275245
# notice the default naming of the columns in the following
# result
dt[1:4, list(mean(z), sd(z))]
##            V1       V2
## 1: -0.2556355 1.275245
# control the names of the resulting columns
dt[1:4, list(z_mean = mean(z), z_sd = sd(z))]
##        z_mean     z_sd
## 1: -0.2556355 1.275245
# select all rows and compute summary statistics of the
# resulting data.table
dt[, .(mean(z), sd(z))] # default column names
##            V1        V2
## 1: -0.3322011 0.9602548
dt[, list(z_mean=mean(z), z_sd=sd(z))] # custom column names
##        z_mean      z_sd
## 1: -0.3322011 0.9602548

3.5 Combine i and j

  • It is straightforward to combine i and j.
# Select all rows for which x==II and return columns y and z
dt[x=='II', .(y, z)]
##    y          z
## 1: b -1.4943341
## 2: c  0.4671942
## 3: c -0.5309587

3.6 Group using by

  • The real magic of data.table comes in the power of the by argument.

  • by allows you apply the operation that you specify with j to separate groups of rows defined by the unique values in the columns specified in by.

  • Put another way, whatever column you pass to by will be split up into groups with each unique value getting its own group. Those groups will then be applied to the columns you specify in j and the operation also specified in j will be applied only to rows from the same group. Then, after all is done, everything is put back into a single data.table.

# inefficient way to group by x
dt[x=="I", mean(z)]
## [1] 0.157264
dt[x=="II", mean(z)]
## [1] -0.5193662
dt[x=="III", mean(z)]
## [1] -0.03245999
dt[x=="IV", mean(z)]
## [1] -0.9342422
# efficient and beautiful way to group by x
dt[, mean(z), .(x)]
##      x          V1
## 1:   I  0.15726397
## 2:  II -0.51936618
## 3: III -0.03245999
## 4:  IV -0.93424222
# the efficient and beautiful way of doing things extends to
# grouping by multiple variables with ease
dt[, mean(z), .(x, y)]
##      x y           V1
## 1:   I a  0.843952015
## 2:   I b -1.216112121
## 3:  II b -1.494334054
## 4:  II c -0.031882243
## 5: III d -0.052062812
## 6: III e  0.006745659
## 7:  IV e -1.836147935
## 8:  IV f -0.483289369
# return more than one summary statistic grouped by x
dt[, .(mean(z), sd(z)), .(x)] # default naming of resulting columns
##      x          V1        V2
## 1:   I  0.15726397 1.1901511
## 2:  II -0.51936618 0.9808155
## 3: III -0.03245999 0.8400513
## 4:  IV -0.93424222 0.9470947
dt[, .(z_mean=mean(z), z_sd=sd(z)), .(x)] # custom naming of resulting columns
##      x      z_mean      z_sd
## 1:   I  0.15726397 1.1901511
## 2:  II -0.51936618 0.9808155
## 3: III -0.03245999 0.8400513
## 4:  IV -0.93424222 0.9470947

3.7 Adding and modifying columns

  • Use the := operator inside the j argument of a data.table to add or modify a column.
# if you pass a single number, data.table will fill the
# entire column with that value
dt[, a := 9]

# otherwise, just pass a vector of the same length as the
# data.table
dt[, b := seq(2, 24, 2)]

# notice that you get an error if you try to create a column
# using a vector that isn't exactly the correct length
dt[, b := seq(2, 26, 2)]
## Error in `[.data.table`(dt, , `:=`(b, seq(2, 26, 2))): Supplied 13 items to be assigned to 12 items of column 'b'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
# redefine or modify an existing column the same way
dt[, b := seq(4, 48, 4)]

# remove a column using := and the NULL keyword
dt[, b := NULL]

# don't use the `<-` operator and the `:=`operator at the
# same time. It works but is inefficient
dt <- dt[, c := 9] # don't do this

# it all works the same with non-numerical columns
dt[, betsy := 'mouse']
dt[, betsy := c('mouse', 'rat')] # notice the error
## Error in `[.data.table`(dt, , `:=`(betsy, c("mouse", "rat"))): Supplied 2 items to be assigned to 12 items of column 'betsy'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
new_col_betsy <- rep(c('mouse', 'rat'), 6)
dt[, betsy := new_col_betsy] # add betsy

3.8 Be careful when copying data.table objects

  • dt2 = dt creates a shallow copy. This means that there is really only one data.table object in memory, but it can be referred to by both names.

  • This means that changes to dt2 will be reflected in dt and vice-versa.

  • You can show that this is the case by modifying dt2 and observing that dt also changes (see code chunks below).

  • To create a deep copy use dt2 = data.table(dt).

# print dt to see what it looks like before messing with it
dt 
##       x y            z a c betsy
##  1:   I a  0.801076986 9 9 mouse
##  2:   I a  0.886827043 9 9   rat
##  3:   I b -1.216112121 9 9 mouse
##  4:  II b -1.494334054 9 9   rat
##  5:  II c  0.467194196 9 9 mouse
##  6:  II c -0.530958683 9 9   rat
##  7: III d -0.891427670 9 9 mouse
##  8: III d  0.787302046 9 9   rat
##  9: III e  0.006745659 9 9 mouse
## 10:  IV e -1.836147935 9 9   rat
## 11:  IV f  0.052353254 9 9 mouse
## 12:  IV f -1.018931991 9 9   rat
# this is a shallow copy
dt_copy = dt
dt_copy[, a := NULL]

# notice we also deleted the `a` column from dt, not just from
# dt_copy
dt
##       x y            z c betsy
##  1:   I a  0.801076986 9 mouse
##  2:   I a  0.886827043 9   rat
##  3:   I b -1.216112121 9 mouse
##  4:  II b -1.494334054 9   rat
##  5:  II c  0.467194196 9 mouse
##  6:  II c -0.530958683 9   rat
##  7: III d -0.891427670 9 mouse
##  8: III d  0.787302046 9   rat
##  9: III e  0.006745659 9 mouse
## 10:  IV e -1.836147935 9   rat
## 11:  IV f  0.052353254 9 mouse
## 12:  IV f -1.018931991 9   rat
# this is a deep copy
dt_copy = data.table(dt)
dt_copy[, x := NULL]

# now our changes to dt_copy didn't also change dt
dt
##       x y            z c betsy
##  1:   I a  0.801076986 9 mouse
##  2:   I a  0.886827043 9   rat
##  3:   I b -1.216112121 9 mouse
##  4:  II b -1.494334054 9   rat
##  5:  II c  0.467194196 9 mouse
##  6:  II c -0.530958683 9   rat
##  7: III d -0.891427670 9 mouse
##  8: III d  0.787302046 9   rat
##  9: III e  0.006745659 9 mouse
## 10:  IV e -1.836147935 9   rat
## 11:  IV f  0.052353254 9 mouse
## 12:  IV f -1.018931991 9   rat

3.9 Getting data into a data.table

  • The first and primary method that will use to get data into a data.table is the fread function, which is part of the data.table library.

  • The following code chunk reads a .csv file into data.table.

# Location of the csv file
# can be on the internet or on your local machine
f <- 'https://crossley.github.io/book_stats/data/maze/maze.csv'

# use the fread function
d <-fread(f)
d
##       rat maze run     time
##    1:   1    1   1 182.5532
##    2:   1    1   2 180.5880
##    3:   1    1   3 161.5351
##    4:   1    1   4 198.2432
##    5:   1    1   5 155.6673
##   ---                      
## 6012:  47    8  12 323.0179
## 6013:  47    8  13 205.0180
## 6014:  47    8  14 228.6531
## 6015:  47    8  15 234.1131
## 6016:  47    8  16 305.8473
  • A second method that we will sometimes use to get data into a data.table is as follows:
s1 <- "family_id age_mother dob_child1 dob_child2 dob_child3
               1         30 1998-11-26 2000-01-29         NA
               2         27 1996-06-22         NA         NA
               3         26 2002-07-11 2004-04-05 2007-09-02
               4         32 2004-10-10 2009-08-27 2012-07-21
               5         29 2000-12-05 2005-02-28         NA"

DT <- fread(s1)
DT
##    family_id age_mother dob_child1 dob_child2 dob_child3
## 1:         1         30 1998-11-26 2000-01-29       <NA>
## 2:         2         27 1996-06-22       <NA>       <NA>
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21
## 5:         5         29 2000-12-05 2005-02-28       <NA>
  • This can be useful in a few odd circumstances. For example, you may wish to copy some text in another program and paste it over into your R session to turn it into a data.table.

3.10 Wide vs long format

  • The following data.table is in long format:
##     run rat  maze_time
##  1:   1   1  88.417468
##  2:   1   2  18.503129
##  3:   1   3 124.226348
##  4:   1   4 -42.509839
##  5:   1   5 136.594112
##  6:   2   1 124.841265
##  7:   2   2 106.528818
##  8:   2   3 101.915639
##  9:   2   4 125.733838
## 10:   2   5  35.098992
## 11:   3   1  88.083124
## 12:   3   2 166.413570
## 13:   3   3 210.096910
## 14:   3   4 114.377148
## 15:   3   5  88.224640
## 16:   4   1   8.793163
## 17:   4   2 -43.758624
## 18:   4   3  20.291047
## 19:   4   4 225.408311
## 20:   4   5 177.214219
## 21:   5   1  78.048437
## 22:   5   2  57.518972
## 23:   5   3  58.101990
## 24:   5   4 199.698686
## 25:   5   5  72.422197
##     run rat  maze_time
  • The following data.table is in wide format:
##    maze_time_rat_1 maze_time_rat_2 maze_time_rat_3 maze_time_rat_4
## 1:       223.83041       16.795670       183.20471        20.46609
## 2:        72.06537      -16.657055        77.26713        94.51225
## 3:       275.79031       -6.559058       126.61374       125.01413
## 4:       156.07461      -56.378205        62.32973       161.82433
## 5:        54.72160      215.653700       344.13646        82.73765
##    maze_time_rat_5 run
## 1:     -122.390027   1
## 2:      -26.361438   2
## 3:      135.872890   3
## 4:       98.895452   4
## 5:        5.935084   5
  • Each format has its pros and cons, and you may find yourself needing to switch between them even within the same dataset.

  • To go from wide to long format you should use the melt function from the data.table library.

  • To go from long to wide format you should use the dcast function from the data.table library.

  • Using these functions is just a bit beyond the scope of this lecture, but we will return later in the unit to learn about them. If you are feeling confident in your use of data.table then you may enjoy getting a little ahead of the game by reading the following vignette:

    reshaping data.table vignette