Get lots of hands-on practice with
data.table.
You should pay very careful attention to learn about:
data.table chaining:=i, operating on columns with
j, and grouping with by using
data.table syntax [i, j, by]It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here
Use the following code chunk to initialise your code.
# load data.table library
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.3
# clean session
rm(list=ls())
# Please include the following line of code as well
set.seed(0)
# create vectors from which to build d
V1 <- rep(month.abb[1:8], each=2)
V2 <- rep(letters[1:8], 2)
V3 <- rnorm(length(V1))
# create a data.table named d
d <- data.table(month=V1, event=V2, obs=V3)
Recall from lecture and tutorial that when working with
data.frame objects, columns can be accessed using the
$ operator. You may also have pushed forward on your own
and found various tutorials outlining how to use dplyr and
other tools in the tidyverse to select rows, perform
operations on columns, and group operations by other variables. This
will have lead you to lines of code using the %>%
operator among others. dplyr and the tidyverse
is awesome, but it isn’t what we are using in this unit. Unless
explicitly instructed otherwise, please use data.table
syntax (i.e., DT[i, j, by]) to answer all questions
below.
a. Use the data.table keyword
.N to report how many rows are in d and store
the result in a variable named ans_1a.
b. Return a data.table containing only
the rows of d for which month=="Jan" and store
the result in a variable named ans_1b.
c. Return a data.table containing only
the rows of d for which month=="Feb" or
event==c and store the result in a variable named
ans_1c.
d. Return a data.table containing only
the rows of d for which month=="Feb" and
event=="c" and store the result in a variable named
ans_1d.
e. Using the %in% operator, return a
data.table containing only the rows of d for
which month is in the vector
c("Jan", "Mar", "May", "Jul"), and store the result in a
variable named ans_1e.
a. Return a vector containing the
month column of d and store the result in a
variable named ans_2a.
b. Return a data.table containing only
the month column of d and store the result in
a variable named ans_2b.
c. Return a data.table containing only
the month and event columns of d
and store the result in a variable named ans_2c.
d. Using only one line of code, return a
data.table containing only the month and
event columns of d, but renamed to
MONTH and EVENT, and store the result in a
variable named ans_2d.
e. Return the mean obs across all rows
in d and store your result in a variable named
ans_2e.
f. Return the mean and standard deviation of the
obs column across all rows in d and store the
result in a data.table named ans_2f. The
resulting data.table should have 1 observation
with columns named obs_mean and obs_sd.
a. Return the mean obs grouped by
event and return the result in a data.table
named ans_3a. Name the column of ans_3a that
contains the mean of the obs column
obs_mean.
b. Return the mean obs grouped by
event and grouped by month and return the
result in a data.table named ans_3b. Name the
column of ans_3b that contains the mean of the
obs column obs_mean.
a. Select the first 8 rows of d and
compute the median value of the obs column grouped by
month and event and return the result in a
data.table named ans_4a. Name the column of
ans_4a that contains the median of the obs
column obs_median.
a. Create a variable named d5a using
d5a = data.table(d). Create a deep copy of d5a
named d5a_copy and use the := operator to add
a column of zeros to d5a_copy. Name the column of zeros
new_zeros. Store the resulting data.table in a
variable named ans_5a. Be very careful that the code you
write and execute here does not modify d5a or you will be
marked incorrect.
b. Create a variable named d5b using
d5b = data.table(d). Create a deep copy of d5b
named d5b_copy and use the := operator to
remove the obs column from d5b_copy. Store the
resulting data.table in a variable named
ans_5b. Be very careful that the code you write and execute
here does not modify d5b or you will be marked
incorrect.
c. Consider the following methods of computing the
mean value of the obs column in d5c – see
below – grouped by month.
d5c <- data.table(d)
# option 1
d5c_2 <- d5c[, .(obs_mean = mean(obs)), .(month)]
# option 2
d5c[, obs_mean := mean(obs), .(month)]
# option 3
d5c <- d5c[, obs_mean := mean(obs), .(month)]
Which option is the least efficient method of using
data.table ? Please store your answer as one of the
following:
ans_5c <- 'option 1'ans_5c <- 'option 2'ans_5c <- 'option 3'd. Consider the following code chunk.
V1 <- c(1, 2, 3)
V2 <- c('a', 'b', 'c')
dd <- data.table(V1, V2)
# option 1
dd2 <- dd
dd2[, V3 := c(T, T, F)]
# option 2
dd2 <- data.table(dd)
dd2[, V3 := c(T, T, F)]
# option 3
dd2 = copy(dd)
dd2[, V3 := c(T, T, F)]
Which option modifies dd? Please store your answer as
one of the following:
ans_5d <- 'option 1'ans_5d <- 'option 2'ans_5d <- 'option 3'data.tableUse the fread function to read the csv file located
at the following address into a data.table:
https://crossley.github.io/book_stats/data/maze_2/maze_2.csv
Consider the following character variable:
s2 <- "ind2 obs
Apr -0.15579551
Apr -1.47075238
May -0.47815006
May 0.41794156
Jun 1.35867955
Jun -0.10278773
Jul 0.38767161
Jul -0.05380504
Aug -1.37705956"
s2 into a data.table named
DT.data.table in wide or long
format?## run rat maze_time
## <int> <int> <num>
## 1: 1 1 88.417468
## 2: 1 2 18.503129
## 3: 1 3 124.226348
## 4: 1 4 -42.509839
## 5: 1 5 136.594112
## 6: 2 1 124.841265
## 7: 2 2 106.528818
## 8: 2 3 101.915639
## 9: 2 4 125.733838
## 10: 2 5 35.098992
## 11: 3 1 88.083124
## 12: 3 2 166.413570
## 13: 3 3 210.096910
## 14: 3 4 114.377148
## 15: 3 5 88.224640
## 16: 4 1 8.793163
## 17: 4 2 -43.758624
## 18: 4 3 20.291047
## 19: 4 4 225.408311
## 20: 4 5 177.214219
## 21: 5 1 78.048437
## 22: 5 2 57.518972
## 23: 5 3 58.101990
## 24: 5 4 199.698686
## 25: 5 5 72.422197
## run rat maze_time
data.table in wide or long
format?## maze_time_rat_1 maze_time_rat_2 maze_time_rat_3 maze_time_rat_4 maze_time_rat_5 run
## <num> <num> <num> <num> <num> <int>
## 1: 223.83041 16.795670 183.20471 20.46609 -122.390027 1
## 2: 72.06537 -16.657055 77.26713 94.51225 -26.361438 2
## 3: 275.79031 -6.559058 126.61374 125.01413 135.872890 3
## 4: 156.07461 -56.378205 62.32973 161.82433 98.895452 4
## 5: 54.72160 215.653700 344.13646 82.73765 5.935084 5
What function from the data.table library would you
use to convert from wide to long?
What function from the data.table library would you
use to convert from long to wide?
Consider an experiment designed to find out whether or not feedback-delay impairs criterion learning? What is criterion learning? Here is an example trial:
In this case, thin bars belong to category A, and thick bars belong to category B. But where exactly is the category boundary (i.e., criterion) that separates thick from thin? Criterion learning is the process that allows our brains to figure this out. Next, you will perform a cursory analysis on the data from this experiment.
a. Use the fread function in the
data.table library to read
https://crossley.github.io/book_stats/data/criterion_learning/crit_learn.csv
into a data.table and store the result in a variable named
ans_6a.
The resulting data.table contains the following
columns:
t: Trial number across the entire
experiment.
t_prob: Trial number per problem.
sub: Subject identifier (i.e., different numbers
correspond to different subjects).
cnd: Condition identifier.
prob_num: Problem number (gets increased by 1 every
time a participant solves a problem).
t2c: Trials to criterion (i.e., the number of trials
it took a participant to solve a particular problem.)
nps: Number of problems solved. This is the same as
max(prob_num)
b. Select only rows for which
cnd=="Delay" or for which cnd=="Long ITI" and
store the result in a new data.table named
ans_8b.
c. From ans_8b select only the columns
cnd, sub and t2c columns and
store the result in a new data.table named
ans_8c.
d. Do not work on this problem until you have finished earlier parts of the homework. When you have, start a timer on a stop watch and then immediately work on the rest of this problem.
Create a new data.table that contains the
mean values of the t2c column grouped by
cnd and sub. Name the column of mean values
t2c_mean. Make sure that your resulting
data.table has only cnd, sub and
t2c_mean columns and nothing more (no columns named
V1 etc). Also make sure that your resulting
data.table only has one observation per sub.
Failure to do so will lead to this question being marked incorrect.
Store the result in a variable named ans_8d.
Stop the timer you started at the beginning of this problem, and write down how long it took you to complete.
e. Do not work on this problem until you have finished earlier parts of the worksheet. When you have, start a timer on a stop watch and then immediately work on the rest of this problem.
Download the .csv file at
https://crossley.github.io/cogs2020/data/criterion_learning/crit_learn.csv
and open it in Excel or any of your favourite spreadsheet programs.
Using this spreadsheet program, perform all of the operations that lead
to ans_8d.
Stop the timer you started at the beginning of this problem, and write down how long it took you to complete.
I’m sure you see where I’m going with this. For many of us,
R and data.table will be the clear winners.
Even if it took longer, it will be clear that – once your skill develops
– it will be quicker and certainly more flexible and powerful in other
ways.
On the other hand, you might be a super spreadsheet master like these
people:
https://stackoverflow.com/questions/37314049/how-to-calculate-the-means-of-100s-of-subgroups-in-excel
If so, more power to you, but you will nevertheless need to learn
R in this unit. :)