Get lots of hands-on practice with
data.table
.
You should pay very careful attention to learn about:
data.table
chaining:=
i
, operating on columns with
j
, and grouping with by
using
data.table
syntax [i, j, by]
It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here
Use the following code chunk to initialise your code.
# load data.table library
library(data.table)
# clean session
rm(list=ls())
# Please include the following line of code as well
set.seed(0)
# create vectors from which to build d
V1 <- rep(month.abb[1:8], each=2)
V2 <- rep(letters[1:8], 2)
V3 <- rnorm(length(V1))
# create a data.table named d
d <- data.table(month=V1, event=V2, obs=V3)
Recall from lecture and tutorial that when working with
data.frame
objects, columns can be accessed using the
$
operator. You may also have pushed forward on your own
and found various tutorials outlining how to use dplyr
and
other tools in the tidyverse
to select rows, perform
operations on columns, and group operations by other variables. This
will have lead you to lines of code using the %>%
operator among others. dplyr
and the tidyverse
is awesome, but it isn’t what we are using in this unit. Unless
explicitly instructed otherwise, please use data.table
syntax (i.e., DT[i, j, by]
) to answer all questions
below.
a. Use the data.table
keyword
.N
to report how many rows are in d
and store
the result in a variable named ans_1a
.
b. Return a data.table
containing only
the rows of d
for which month=="Jan"
and store
the result in a variable named ans_1b
.
c. Return a data.table
containing only
the rows of d
for which month=="Feb"
or
event==c
and store the result in a variable named
ans_1c
.
d. Return a data.table
containing only
the rows of d
for which month=="Feb"
and
event=="c"
and store the result in a variable named
ans_1d
.
e. Using the %in%
operator, return a
data.table
containing only the rows of d
for
which month
is in the vector
c("Jan", "Mar", "May", "Jul")
, and store the result in a
variable named ans_1e
.
a. Return a vector
containing the
month
column of d
and store the result in a
variable named ans_2a
.
b. Return a data.table
containing only
the month
column of d
and store the result in
a variable named ans_2b
.
c. Return a data.table
containing only
the month
and event
columns of d
and store the result in a variable named ans_2c
.
d. Using only one line of code, return a
data.table
containing only the month
and
event
columns of d
, but renamed to
MONTH
and EVENT
, and store the result in a
variable named ans_2d
.
e. Return the mean obs
across all rows
in d
and store your result in a variable named
ans_2e
.
f. Return the mean and standard deviation of the
obs
column across all rows in d
and store the
result in a data.table
named ans_2f
. The
resulting data.table
should have 1
observation
with columns named obs_mean
and obs_sd
.
a. Return the mean obs
grouped by
event
and return the result in a data.table
named ans_3a
. Name the column of ans_3a
that
contains the mean of the obs
column
obs_mean
.
b. Return the mean obs
grouped by
event
and grouped by month
and return the
result in a data.table
named ans_3b
. Name the
column of ans_3b
that contains the mean of the
obs
column obs_mean
.
a. Select the first 8 rows of d
and
compute the median value of the obs
column grouped by
month
and event
and return the result in a
data.table
named ans_4a
. Name the column of
ans_4a
that contains the median of the obs
column obs_median
.
a. Create a variable named d5a
using
d5a = data.table(d)
. Create a deep copy of d5a
named d5a_copy
and use the :=
operator to add
a column of zeros to d5a_copy
. Name the column of zeros
new_zeros
. Store the resulting data.table
in a
variable named ans_5a
. Be very careful that the code you
write and execute here does not modify d5a
or you will be
marked incorrect.
b. Create a variable named d5b
using
d5b = data.table(d)
. Create a deep copy of d5b
named d5b_copy
and use the :=
operator to
remove the obs
column from d5b_copy
. Store the
resulting data.table
in a variable named
ans_5b
. Be very careful that the code you write and execute
here does not modify d5b
or you will be marked
incorrect.
c. Consider the following methods of computing the
mean value of the obs
column in d5c
– see
below – grouped by month
.
d5c <- data.table(d)
# option 1
d5c_2 <- d5c[, .(obs_mean = mean(obs)), .(month)]
# option 2
d5c[, obs_mean := mean(obs), .(month)]
# option 3
d5c <- d5c[, obs_mean := mean(obs), .(month)]
Which option is the least efficient method of using
data.table
? Please store your answer as one of the
following:
ans_5c <- 'option 1'
ans_5c <- 'option 2'
ans_5c <- 'option 3'
d. Consider the following code chunk.
V1 <- c(1, 2, 3)
V2 <- c('a', 'b', 'c')
dd <- data.table(V1, V2)
# option 1
dd2 <- dd
dd2[, V3 := c(T, T, F)]
# option 2
dd2 <- data.table(dd)
dd2[, V3 := c(T, T, F)]
# option 3
dd2 = copy(dd)
dd2[, V3 := c(T, T, F)]
Which option modifies dd
? Please store your answer as
one of the following:
ans_5d <- 'option 1'
ans_5d <- 'option 2'
ans_5d <- 'option 3'
data.table
Use the fread
function to read the csv file located
at the following address into a data.table
:
https://crossley.github.io/book_stats/data/maze_2/maze_2.csv
Consider the following character
variable:
s2 <- "ind2 obs
Apr -0.15579551
Apr -1.47075238
May -0.47815006
May 0.41794156
Jun 1.35867955
Jun -0.10278773
Jul 0.38767161
Jul -0.05380504
Aug -1.37705956"
s2
into a data.table
named
DT
.data.table
in wide or long
format?## run rat maze_time
## <int> <int> <num>
## 1: 1 1 88.417468
## 2: 1 2 18.503129
## 3: 1 3 124.226348
## 4: 1 4 -42.509839
## 5: 1 5 136.594112
## 6: 2 1 124.841265
## 7: 2 2 106.528818
## 8: 2 3 101.915639
## 9: 2 4 125.733838
## 10: 2 5 35.098992
## 11: 3 1 88.083124
## 12: 3 2 166.413570
## 13: 3 3 210.096910
## 14: 3 4 114.377148
## 15: 3 5 88.224640
## 16: 4 1 8.793163
## 17: 4 2 -43.758624
## 18: 4 3 20.291047
## 19: 4 4 225.408311
## 20: 4 5 177.214219
## 21: 5 1 78.048437
## 22: 5 2 57.518972
## 23: 5 3 58.101990
## 24: 5 4 199.698686
## 25: 5 5 72.422197
## run rat maze_time
data.table
in wide or long
format?## maze_time_rat_1 maze_time_rat_2 maze_time_rat_3 maze_time_rat_4 maze_time_rat_5 run
## <num> <num> <num> <num> <num> <int>
## 1: 223.83041 16.795670 183.20471 20.46609 -122.390027 1
## 2: 72.06537 -16.657055 77.26713 94.51225 -26.361438 2
## 3: 275.79031 -6.559058 126.61374 125.01413 135.872890 3
## 4: 156.07461 -56.378205 62.32973 161.82433 98.895452 4
## 5: 54.72160 215.653700 344.13646 82.73765 5.935084 5
What function from the data.table
library would you
use to convert from wide to long?
What function from the data.table
library would you
use to convert from long to wide?
Consider an experiment designed to find out whether or not feedback-delay impairs criterion learning? What is criterion learning? Here is an example trial:
In this case, thin bars belong to category A, and thick bars belong to category B. But where exactly is the category boundary (i.e., criterion) that separates thick from thin? Criterion learning is the process that allows our brains to figure this out. Next, you will perform a cursory analysis on the data from this experiment.
a. Use the fread
function in the
data.table
library to read
https://crossley.github.io/book_stats/data/criterion_learning/crit_learn.csv
into a data.table
and store the result in a variable named
ans_6a
.
The resulting data.table
contains the following
columns:
t
: Trial number across the entire
experiment.
t_prob
: Trial number per problem.
sub
: Subject identifier (i.e., different numbers
correspond to different subjects).
cnd
: Condition identifier.
prob_num
: Problem number (gets increased by 1 every
time a participant solves a problem).
t2c
: Trials to criterion (i.e., the number of trials
it took a participant to solve a particular problem.)
nps
: Number of problems solved. This is the same as
max(prob_num)
b. Select only rows for which
cnd=="Delay"
or for which cnd=="Long ITI"
and
store the result in a new data.table
named
ans_8b
.
c. From ans_8b
select only the columns
cnd
, sub
and t2c
columns and
store the result in a new data.table
named
ans_8c
.
d. Do not work on this problem until you have finished earlier parts of the homework. When you have, start a timer on a stop watch and then immediately work on the rest of this problem.
Create a new data.table
that contains the
mean
values of the t2c
column grouped by
cnd
and sub
. Name the column of mean values
t2c_mean
. Make sure that your resulting
data.table
has only cnd
, sub
and
t2c_mean
columns and nothing more (no columns named
V1
etc). Also make sure that your resulting
data.table
only has one observation per sub
.
Failure to do so will lead to this question being marked incorrect.
Store the result in a variable named ans_8d
.
Stop the timer you started at the beginning of this problem, and write down how long it took you to complete.
e. Do not work on this problem until you have finished earlier parts of the worksheet. When you have, start a timer on a stop watch and then immediately work on the rest of this problem.
Download the .csv
file at
https://crossley.github.io/cogs2020/data/criterion_learning/crit_learn.csv
and open it in Excel or any of your favourite spreadsheet programs.
Using this spreadsheet program, perform all of the operations that lead
to ans_8d
.
Stop the timer you started at the beginning of this problem, and write down how long it took you to complete.
I’m sure you see where I’m going with this. For many of us,
R
and data.table
will be the clear winners.
Even if it took longer, it will be clear that – once your skill develops
– it will be quicker and certainly more flexible and powerful in other
ways.
On the other hand, you might be a super spreadsheet master like these
people:
https://stackoverflow.com/questions/37314049/how-to-calculate-the-means-of-100s-of-subgroups-in-excel
If so, more power to you, but you will nevertheless need to learn
R
in this unit. :)