data.table
practice exercises# load libraries
library(data.table)
library(ggplot2)
# clean work space
rm(list = ls())
# init colorscheme
COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
theme_minimal(base_size = 13) +
theme(
panel.grid.minor = element_blank(),
strip.text = element_text(face = "bold"),
legend.position = "bottom"
)
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))
Get lots of hands-on practice with
data.table.
You should pay very careful attention to learn about:
data.table chaining:=i, operating on columns with
j, and grouping with by using
data.table syntax [i, j, by]# Please include the following line of code as well
set.seed(0)
# create vectors from which to build d
V1 <- rep(month.abb[1:8], each=2)
V2 <- rep(letters[1:8], 2)
V3 <- rnorm(length(V1))
# create a data.table named d
d <- data.table(month=V1, event=V2, obs=V3)
Recall from lecture and tutorial that when working with
data.frame objects, columns can be accessed using the
$ operator. You may also have pushed forward on your own
and found various tutorials outlining how to use dplyr and
other tools in the tidyverse to select rows, perform
operations on columns, and group operations by other variables. This
will have lead you to lines of code using the %>%
operator among others. dplyr and the tidyverse
is awesome, but it isn’t what we are using in this unit. Unless
explicitly instructed otherwise, please use data.table
syntax (i.e., DT[i, j, by]) to answer all questions
below.
a. Use the data.table keyword
.N to report how many rows are in d and store
the result in a variable named ans_1a.
b. Return a data.table containing only
the rows of d for which month=="Jan" and store
the result in a variable named ans_1b.
c. Return a data.table containing only
the rows of d for which month=="Feb" or
event=="c" and store the result in a variable named
ans_1c.
d. Return a data.table containing only
the rows of d for which month=="Feb" and
event=="c" and store the result in a variable named
ans_1d.
e. Using the %in% operator, return a
data.table containing only the rows of d for
which month is in the vector
c("Jan", "Mar", "May", "Jul"), and store the result in a
variable named ans_1e.
a. Return a vector containing the
month column of d and store the result in a
variable named ans_2a.
b. Return a data.table containing only
the month column of d and store the result in
a variable named ans_2b.
c. Return a data.table containing only
the month and event columns of d
and store the result in a variable named ans_2c.
d. Using only one line of code, return a
data.table containing only the month and
event columns of d, but renamed to
MONTH and EVENT, and store the result in a
variable named ans_2d.
e. Return the mean obs across all rows
in d and store your result in a variable named
ans_2e.
f. Return the mean and standard deviation of the
obs column across all rows in d and store the
result in a data.table named ans_2f. The
resulting data.table should have 1 observation
with columns named obs_mean and obs_sd.
a. Return the mean obs grouped by
event and return the result in a data.table
named ans_3a. Name the column of ans_3a that
contains the mean of the obs column
obs_mean.
b. Return the mean obs grouped by
event and grouped by month and return the
result in a data.table named ans_3b. Name the
column of ans_3b that contains the mean of the
obs column obs_mean.
a. Select the first 8 rows of d and
compute the median value of the obs column grouped by
month and event and return the result in a
data.table named ans_4a. Name the column of
ans_4a that contains the median of the obs
column obs_median.
a. Create a variable named d5a using
d5a = data.table(d). Create a deep copy of d5a
named d5a_copy and use the := operator to add
a column of zeros to d5a_copy. Name the column of zeros
new_zeros. Store the resulting data.table in a
variable named ans_5a. Be very careful that the code you
write and execute here does not modify d5a.
b. Create a variable named d5b using
d5b = data.table(d). Create a deep copy of d5b
named d5b_copy and use the := operator to
remove the obs column from d5b_copy. Store the
resulting data.table in a variable named
ans_5b. Be very careful that the code you write and execute
here does not modify d5b.
c. Consider the following methods of computing the
mean value of the obs column in d5c – see
below – grouped by month.
d5c <- data.table(d)
# option 1
d5c_2 <- d5c[, .(obs_mean = mean(obs)), .(month)]
# option 2
d5c[, obs_mean := mean(obs), .(month)]
# option 3
d5c <- d5c[, obs_mean := mean(obs), .(month)]
Which option is the least efficient method of using
data.table ? Please store your answer as one of the
following:
ans_5c <- 'option 1'ans_5c <- 'option 2'ans_5c <- 'option 3'd. Consider the following code chunk.
V1 <- c(1, 2, 3)
V2 <- c('a', 'b', 'c')
dd <- data.table(V1, V2)
# option 1
dd2 <- dd
dd2[, V3 := c(T, T, F)]
# option 2
dd2 <- data.table(dd)
dd2[, V3 := c(T, T, F)]
# option 3
dd2 = copy(dd)
dd2[, V3 := c(T, T, F)]
Which option modifies dd? Please store your answer as
one of the following:
ans_5d <- 'option 1'ans_5d <- 'option 2'ans_5d <- 'option 3'data.tableUse the fread function to read the csv file located
at data/maze_2.csv into a data.table.
Consider the following character variable:
s2 <- "ind2 obs
Apr -0.15579551
Apr -1.47075238
May -0.47815006
May 0.41794156
Jun 1.35867955
Jun -0.10278773
Jul 0.38767161
Jul -0.05380504
Aug -1.37705956"
s2 into a data.table named
DT.data.table in wide or long
format?## participant condition maze_time
## <int> <char> <num>
## 1: 1 mesh 3.4
## 2: 1 wood 3.8
## 3: 2 mesh 3.1
## 4: 2 wood 3.6
## 5: 3 mesh 3.5
## 6: 3 wood 3.9
data.table in wide or long
format?## participant mesh wood
## <int> <num> <num>
## 1: 1 3.4 3.8
## 2: 2 3.1 3.6
## 3: 3 3.5 3.9
What function from the data.table library would you
use to convert from wide to long?
What function from the data.table library would you
use to convert from long to wide?
Consider an experiment designed to find out whether or not reward contingencies influence learning in a simple spatial decision task. In this experiment, rats repeatedly run a T-maze and must learn which arm is more likely to produce a reward.
On each trial the rat starts at the base of the maze and chooses to go left or right at the T-junction. One arm is more likely to produce a reward than the other. Over time, rats learn which choice is better. Learning in this context means gradually adjusting behaviour so that rewarded choices are made more often.
Next, you will perform a cursory analysis on data from this experiment.
For this exercise, use the local file in the tutorial folder. If your working directory is set correctly, you do not need anything beyond the downloaded tutorial folder.
a. Use the fread function in the
data.table library to read the file at
data/experiment_1_summary.csv
into a data.table and store the result in a variable
named ans_8a.
The resulting data.table contains the following
columns:
experiment: Experiment identifier.
rat_id: Subject identifier (i.e., different numbers
correspond to different rats).
trial: Trial number within the experiment.
cage_context: Context in which the maze was run
(e.g., mesh or wood).
scent: Scent present in the testing
environment.
choice: The rat’s choice at the T-junction
(L or R).
reward: Outcome of the trial (1 =
reward, 0 = no reward).
reaction_time: Time taken to initiate the
choice.
maze_run_time: Time taken to run from the start of
the maze to the goal location.
b. Select only rows for which
cage_context=="mesh" or for which
cage_context=="wood" and store the result in a new
data.table named ans_8b.
c. From ans_8b select only the columns
rat_id, cage_context, and
maze_run_time, and store the result in a new
data.table named ans_8c.
d.
Do not work on this problem until you have finished earlier parts of the worksheet. When you have, start a timer on a stop watch and then immediately work on the rest of this problem.
Create a new data.table that contains the
mean values of the maze_run_time column
grouped by rat_id and cage_context. Name the
column of mean values maze_run_time_mean. Make sure that
your resulting data.table has only rat_id,
cage_context, and maze_run_time_mean columns
and nothing more (no columns named V1 etc). Also make sure
that your resulting data.table only has one observation per
rat_id and cage_context. Store the result in a
variable named ans_8d.
Stop the timer you started at the beginning of this problem, and write down how long it took you to complete.
e. Do not work on this problem until you have finished earlier parts of the worksheet. When you have, start a timer on a stop watch and then immediately work on the rest of this problem.
Open the file experiment_1_summary.csv from the course
data page in Excel or any of your favourite spreadsheet programs. Using
this spreadsheet program, perform all of the operations that lead to
ans_8d.
Stop the timer you started at the beginning of this problem, and write down how long it took you to complete.
This comparison should make one point clear: for many people,
R and data.table quickly become the more
efficient option. Even if the first attempt takes longer, the workflow
usually becomes faster, more flexible, and more powerful as skill
develops.
On the other hand, you might be a super spreadsheet master like these people: https://stackoverflow.com/questions/37314049/how-to-calculate-the-means-of-100s-of-subgroups-in-excel
Even then, this unit uses R, so the goal here is to
build confidence with that workflow.