3 Introduction to data.table
This chapter introduces the data.table
library, a
high-performance alternative to R’s standard data.frame
.
We will cover the basics of working with data.table
objects, focusing on key operations such as creating data
tables, subsetting rows, selecting and manipulating columns,
and grouping operations. It explains the data.table
syntax
DT[i, j, by]
, where i
filters rows, j
selects columns
or performs calculations, and by
allows for group-wise
operations.
Practical examples demonstrate how to:
Load the
data.table
library and create adata.table
.Subset rows using integers, logical expressions, and the
%in%
operator.Select and operate on columns, including summarizing data with functions like
mean()
andsd()
.Combine row selection (
i
) and column operations (j
) for more complex data manipulation.Group data using the
by
argument to apply functions within groups.Add and modify columns within a
data.table
using the:=
operator.Manage data table copies and understand the difference between shallow and deep copies.
Import data into a
data.table
using thefread
function.
We will also see how to deal with different data formatting,
using melt
and dcast
functions for reshaping data tables
from wide format to long or vice versa.
3.1 Creating a data.table
# Load the data.table library
library(data.table)
# Create vectors to later store in a data frame
x <- c('I', 'I', 'I', 'II', 'II', 'II', 'III', 'III', 'III', 'IV', 'IV', 'IV')
y <- c('a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'e', 'e', 'f', 'f')
z <- rnorm(12)
# create a data table
dt <- data.table(x, y, z)
dt
## x y z
## 1: I a 0.801076986
## 2: I a 0.886827043
## 3: I b -1.216112121
## 4: II b -1.494334054
## 5: II c 0.467194196
## 6: II c -0.530958683
## 7: III d -0.891427670
## 8: III d 0.787302046
## 9: III e 0.006745659
## 10: IV e -1.836147935
## 11: IV f 0.052353254
## 12: IV f -1.018931991
3.2 Subset rows in i
There are many ways to select rows.
Set
i
equal to a vector of integers.Set
i
equal to a logical expression.It will sometimes be useful to use the
%in%
operator when specifyingi
.
## x y z
## 1: I a 0.801077
## 2: I a 0.886827
## 3: I b -1.216112
## 4: II b -1.494334
## x y z
## 1: II b -1.4943341
## 2: II c 0.4671942
## 3: II c -0.5309587
## 4: III d -0.8914277
## x y z
## 1: I a 0.886827
## 2: II b -1.494334
## x y z
## 1: I a 0.88682704
## 2: II c 0.46719420
## 3: IV f 0.05235325
## x y z
## 1: II b -1.4943341
## 2: II c 0.4671942
## 3: II c -0.5309587
## x y z
## 1: II c 0.4671942
## 2: II c -0.5309587
## x y z
## 1: II c 0.4671942
## x y z
## 1: II c 0.4671942
## x y z
## 1: I a 0.8010770
## 2: I a 0.8868270
## 3: I b -1.2161121
## 4: II b -1.4943341
## 5: II c 0.4671942
## 6: II c -0.5309587
## x y z
## 1: I b -1.21611212
## 2: II b -1.49433405
## 3: IV f 0.05235325
## 4: IV f -1.01893199
# Notice that using the `%in%` operator is essentially
# performing an *or* logical operation
dt[(x=='I') | (x=='II')]
## x y z
## 1: I a 0.8010770
## 2: I a 0.8868270
## 3: I b -1.2161121
## 4: II b -1.4943341
## 5: II c 0.4671942
## 6: II c -0.5309587
## x y z
## 1: I b -1.21611212
## 2: II b -1.49433405
## 3: IV f 0.05235325
## 4: IV f -1.01893199
If you want to operate on every row of some set of columns, leave the
i
argument blank.This will result in the first character inside the square brackets being a comma, which looks a bit strange if you’re not used to it. Just remember to read it as select every row.
data.table
also provides a keyword.N
that can be useful in a number of ways. One such way is to use it in selecting rows as shown in the next code chunk.
## x y z
## 1: I a 0.801076986
## 2: I a 0.886827043
## 3: I b -1.216112121
## 4: II b -1.494334054
## 5: II c 0.467194196
## 6: II c -0.530958683
## 7: III d -0.891427670
## 8: III d 0.787302046
## 9: III e 0.006745659
## 10: IV e -1.836147935
## 11: IV f 0.052353254
## 12: IV f -1.018931991
## x y z
## 1: I a 0.801076986
## 2: I a 0.886827043
## 3: I b -1.216112121
## 4: II b -1.494334054
## 5: II c 0.467194196
## 6: II c -0.530958683
## 7: III d -0.891427670
## 8: III d 0.787302046
## 9: III e 0.006745659
## 10: IV e -1.836147935
## 11: IV f 0.052353254
## 12: IV f -1.018931991
## x y z
## 1: I a 0.801076986
## 2: I a 0.886827043
## 3: I b -1.216112121
## 4: II b -1.494334054
## 5: II c 0.467194196
## 6: II c -0.530958683
## 7: III d -0.891427670
## 8: III d 0.787302046
## 9: III e 0.006745659
## 10: IV e -1.836147935
## 11: IV f 0.052353254
## 12: IV f -1.018931991
## x y z
## 1: I a 0.886827043
## 2: I b -1.216112121
## 3: II b -1.494334054
## 4: II c 0.467194196
## 5: II c -0.530958683
## 6: III d -0.891427670
## 7: III d 0.787302046
## 8: III e 0.006745659
## 9: IV e -1.836147935
## 10: IV f 0.052353254
## 11: IV f -1.018931991
3.3 Select columns with j
To operate on specific columns, set
j
equal to a list of the names of the columns you want.Inside the square brackets of a
data.table
,list
can be abbreviated with a.
, which can also look strage at first. Once you get used to it, you will appreciate the brevity.
3.4 Select and operate on columns with j
- You can do a lot more than just return the columns
specified in
j
. In fact, you can perform any operation you want on them.
## y z
## 1: a 0.801076986
## 2: a 0.886827043
## 3: b -1.216112121
## 4: b -1.494334054
## 5: c 0.467194196
## 6: c -0.530958683
## 7: d -0.891427670
## 8: d 0.787302046
## 9: e 0.006745659
## 10: e -1.836147935
## 11: f 0.052353254
## 12: f -1.018931991
## y z
## 1: a 0.801076986
## 2: a 0.886827043
## 3: b -1.216112121
## 4: b -1.494334054
## 5: c 0.467194196
## 6: c -0.530958683
## 7: d -0.891427670
## 8: d 0.787302046
## 9: e 0.006745659
## 10: e -1.836147935
## 11: f 0.052353254
## 12: f -1.018931991
## [1] 0.801077 0.886827 -1.216112 -1.494334
## z
## 1: 0.801077
## 2: 0.886827
## 3: -1.216112
## 4: -1.494334
# select rows 1:4 and return the z column of the result as a
# data.table, but use `.` to abbreviate `list`. This is a
# thing specific to data.table.
dt[1:4, .(z)]
## z
## 1: 0.801077
## 2: 0.886827
## 3: -1.216112
## 4: -1.494334
# select and operate on columns
# select rows 1:4 and return a summary statistic of the
# resulting z column
dt[1:4, mean(z)]
## [1] -0.2556355
## [1] 1.275245
## V1 V2
## 1: -0.2556355 1.275245
## z_mean z_sd
## 1: -0.2556355 1.275245
# select all rows and compute summary statistics of the
# resulting data.table
dt[, .(mean(z), sd(z))] # default column names
## V1 V2
## 1: -0.3322011 0.9602548
## z_mean z_sd
## 1: -0.3322011 0.9602548
3.5 Combine i
and j
- It is straightforward to combine
i
andj
.
## y z
## 1: b -1.4943341
## 2: c 0.4671942
## 3: c -0.5309587
3.6 Group using by
The real magic of
data.table
comes in the power of theby
argument.by
allows you apply the operation that you specify withj
to separate groups of rows defined by the unique values in the columns specified inby
.Put another way, whatever column you pass to
by
will be split up into groups with each unique value getting its own group. Those groups will then be applied to the columns you specify inj
and the operation also specified inj
will be applied only to rows from the same group. Then, after all is done, everything is put back into a singledata.table
.
## [1] 0.157264
## [1] -0.5193662
## [1] -0.03245999
## [1] -0.9342422
## x V1
## 1: I 0.15726397
## 2: II -0.51936618
## 3: III -0.03245999
## 4: IV -0.93424222
# the efficient and beautiful way of doing things extends to
# grouping by multiple variables with ease
dt[, mean(z), .(x, y)]
## x y V1
## 1: I a 0.843952015
## 2: I b -1.216112121
## 3: II b -1.494334054
## 4: II c -0.031882243
## 5: III d -0.052062812
## 6: III e 0.006745659
## 7: IV e -1.836147935
## 8: IV f -0.483289369
# return more than one summary statistic grouped by x
dt[, .(mean(z), sd(z)), .(x)] # default naming of resulting columns
## x V1 V2
## 1: I 0.15726397 1.1901511
## 2: II -0.51936618 0.9808155
## 3: III -0.03245999 0.8400513
## 4: IV -0.93424222 0.9470947
## x z_mean z_sd
## 1: I 0.15726397 1.1901511
## 2: II -0.51936618 0.9808155
## 3: III -0.03245999 0.8400513
## 4: IV -0.93424222 0.9470947
3.7 Adding and modifying columns
- Use the
:=
operator inside thej
argument of adata.table
to add or modify a column.
# if you pass a single number, data.table will fill the
# entire column with that value
dt[, a := 9]
# otherwise, just pass a vector of the same length as the
# data.table
dt[, b := seq(2, 24, 2)]
# notice that you get an error if you try to create a column
# using a vector that isn't exactly the correct length
dt[, b := seq(2, 26, 2)]
## Error in `[.data.table`(dt, , `:=`(b, seq(2, 26, 2))): Supplied 13 items to be assigned to 12 items of column 'b'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
# redefine or modify an existing column the same way
dt[, b := seq(4, 48, 4)]
# remove a column using := and the NULL keyword
dt[, b := NULL]
# don't use the `<-` operator and the `:=`operator at the
# same time. It works but is inefficient
dt <- dt[, c := 9] # don't do this
# it all works the same with non-numerical columns
dt[, betsy := 'mouse']
dt[, betsy := c('mouse', 'rat')] # notice the error
## Error in `[.data.table`(dt, , `:=`(betsy, c("mouse", "rat"))): Supplied 2 items to be assigned to 12 items of column 'betsy'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
3.8 Be careful when copying data.table
objects
dt2 = dt
creates a shallow copy. This means that there is really only onedata.table
object in memory, but it can be referred to by both names.This means that changes to
dt2
will be reflected indt
and vice-versa.You can show that this is the case by modifying dt2 and observing that dt also changes (see code chunks below).
To create a deep copy use
dt2 = data.table(dt)
.
## x y z a c betsy
## 1: I a 0.801076986 9 9 mouse
## 2: I a 0.886827043 9 9 rat
## 3: I b -1.216112121 9 9 mouse
## 4: II b -1.494334054 9 9 rat
## 5: II c 0.467194196 9 9 mouse
## 6: II c -0.530958683 9 9 rat
## 7: III d -0.891427670 9 9 mouse
## 8: III d 0.787302046 9 9 rat
## 9: III e 0.006745659 9 9 mouse
## 10: IV e -1.836147935 9 9 rat
## 11: IV f 0.052353254 9 9 mouse
## 12: IV f -1.018931991 9 9 rat
# this is a shallow copy
dt_copy = dt
dt_copy[, a := NULL]
# notice we also deleted the `a` column from dt, not just from
# dt_copy
dt
## x y z c betsy
## 1: I a 0.801076986 9 mouse
## 2: I a 0.886827043 9 rat
## 3: I b -1.216112121 9 mouse
## 4: II b -1.494334054 9 rat
## 5: II c 0.467194196 9 mouse
## 6: II c -0.530958683 9 rat
## 7: III d -0.891427670 9 mouse
## 8: III d 0.787302046 9 rat
## 9: III e 0.006745659 9 mouse
## 10: IV e -1.836147935 9 rat
## 11: IV f 0.052353254 9 mouse
## 12: IV f -1.018931991 9 rat
# this is a deep copy
dt_copy = data.table(dt)
dt_copy[, x := NULL]
# now our changes to dt_copy didn't also change dt
dt
## x y z c betsy
## 1: I a 0.801076986 9 mouse
## 2: I a 0.886827043 9 rat
## 3: I b -1.216112121 9 mouse
## 4: II b -1.494334054 9 rat
## 5: II c 0.467194196 9 mouse
## 6: II c -0.530958683 9 rat
## 7: III d -0.891427670 9 mouse
## 8: III d 0.787302046 9 rat
## 9: III e 0.006745659 9 mouse
## 10: IV e -1.836147935 9 rat
## 11: IV f 0.052353254 9 mouse
## 12: IV f -1.018931991 9 rat
3.9 Getting data into a data.table
The first and primary method that will use to get data into a
data.table
is thefread
function, which is part of thedata.table
library.The following code chunk reads a .csv file into
data.table
.
# Location of the csv file
# can be on the internet or on your local machine
f <- 'https://crossley.github.io/book_stats/data/maze/maze.csv'
# use the fread function
d <-fread(f)
d
## rat maze run time
## 1: 1 1 1 182.5532
## 2: 1 1 2 180.5880
## 3: 1 1 3 161.5351
## 4: 1 1 4 198.2432
## 5: 1 1 5 155.6673
## ---
## 6012: 47 8 12 323.0179
## 6013: 47 8 13 205.0180
## 6014: 47 8 14 228.6531
## 6015: 47 8 15 234.1131
## 6016: 47 8 16 305.8473
- A second method that we will sometimes use to get data into
a
data.table
is as follows:
s1 <- "family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA"
DT <- fread(s1)
DT
## family_id age_mother dob_child1 dob_child2 dob_child3
## 1: 1 30 1998-11-26 2000-01-29 <NA>
## 2: 2 27 1996-06-22 <NA> <NA>
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21
## 5: 5 29 2000-12-05 2005-02-28 <NA>
- This can be useful in a few odd circumstances. For
example, you may wish to copy some text in another program
and paste it over into your
R
session to turn it into adata.table
.
3.10 Wide vs long format
- The following
data.table
is in long format:
## run rat maze_time
## 1: 1 1 88.417468
## 2: 1 2 18.503129
## 3: 1 3 124.226348
## 4: 1 4 -42.509839
## 5: 1 5 136.594112
## 6: 2 1 124.841265
## 7: 2 2 106.528818
## 8: 2 3 101.915639
## 9: 2 4 125.733838
## 10: 2 5 35.098992
## 11: 3 1 88.083124
## 12: 3 2 166.413570
## 13: 3 3 210.096910
## 14: 3 4 114.377148
## 15: 3 5 88.224640
## 16: 4 1 8.793163
## 17: 4 2 -43.758624
## 18: 4 3 20.291047
## 19: 4 4 225.408311
## 20: 4 5 177.214219
## 21: 5 1 78.048437
## 22: 5 2 57.518972
## 23: 5 3 58.101990
## 24: 5 4 199.698686
## 25: 5 5 72.422197
## run rat maze_time
- The following
data.table
is in wide format:
## maze_time_rat_1 maze_time_rat_2 maze_time_rat_3 maze_time_rat_4
## 1: 223.83041 16.795670 183.20471 20.46609
## 2: 72.06537 -16.657055 77.26713 94.51225
## 3: 275.79031 -6.559058 126.61374 125.01413
## 4: 156.07461 -56.378205 62.32973 161.82433
## 5: 54.72160 215.653700 344.13646 82.73765
## maze_time_rat_5 run
## 1: -122.390027 1
## 2: -26.361438 2
## 3: 135.872890 3
## 4: 98.895452 4
## 5: 5.935084 5
Each format has its pros and cons, and you may find yourself needing to switch between them even within the same dataset.
To go from wide to long format you should use the
melt
function from thedata.table
library.To go from long to wide format you should use the
dcast
function from thedata.table
library.Using these functions is just a bit beyond the scope of this lecture, but we will return later in the unit to learn about them. If you are feeling confident in your use of
data.table
then you may enjoy getting a little ahead of the game by reading the following vignette: