# load libraries

library(data.table)
library(ggplot2)

# clean work space

rm(list = ls())

# init colorscheme

COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
  theme_minimal(base_size = 13) +
    theme(
      panel.grid.minor = element_blank(),
      strip.text = element_text(face = "bold"),
      legend.position = "bottom"
    )
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))

Learning objectives

  • Get lots of hands-on practice with data.table.

  • You should pay very careful attention to learn about:

    • data.table chaining
    • Adding columns using :=
    • Selecting rows with i, operating on columns with j, and grouping with by using data.table syntax [i, j, by]

Work through these practice exercises

  • It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here
# Please include the following line of code as well
set.seed(0)

# create vectors from which to build d
V1 <- rep(month.abb[1:8], each=2)
V2 <- rep(letters[1:8], 2)
V3 <- rnorm(length(V1))

# create a data.table named d
d <- data.table(month=V1, event=V2, obs=V3)

Recall from lecture and tutorial that when working with data.frame objects, columns can be accessed using the $ operator. You may also have pushed forward on your own and found various tutorials outlining how to use dplyr and other tools in the tidyverse to select rows, perform operations on columns, and group operations by other variables. This will have lead you to lines of code using the %>% operator among others. dplyr and the tidyverse is awesome, but it isn’t what we are using in this unit. Unless explicitly instructed otherwise, please use data.table syntax (i.e., DT[i, j, by]) to answer all questions below.

1. Select rows

a. Use the data.table keyword .N to report how many rows are in d and store the result in a variable named ans_1a.

b. Return a data.table containing only the rows of d for which month=="Jan" and store the result in a variable named ans_1b.

c. Return a data.table containing only the rows of d for which month=="Feb" or event=="c" and store the result in a variable named ans_1c.

d. Return a data.table containing only the rows of d for which month=="Feb" and event=="c" and store the result in a variable named ans_1d.

e. Using the %in% operator, return a data.table containing only the rows of d for which month is in the vector c("Jan", "Mar", "May", "Jul"), and store the result in a variable named ans_1e.

2. Select and operate on columns

a. Return a vector containing the month column of d and store the result in a variable named ans_2a.

b. Return a data.table containing only the month column of d and store the result in a variable named ans_2b.

c. Return a data.table containing only the month and event columns of d and store the result in a variable named ans_2c.

d. Using only one line of code, return a data.table containing only the month and event columns of d, but renamed to MONTH and EVENT, and store the result in a variable named ans_2d.

e. Return the mean obs across all rows in d and store your result in a variable named ans_2e.

f. Return the mean and standard deviation of the obs column across all rows in d and store the result in a data.table named ans_2f. The resulting data.table should have 1 observation with columns named obs_mean and obs_sd.

3. Group by

a. Return the mean obs grouped by event and return the result in a data.table named ans_3a. Name the column of ans_3a that contains the mean of the obs column obs_mean.

b. Return the mean obs grouped by event and grouped by month and return the result in a data.table named ans_3b. Name the column of ans_3b that contains the mean of the obs column obs_mean.

4. [i, j, by]

a. Select the first 8 rows of d and compute the median value of the obs column grouped by month and event and return the result in a data.table named ans_4a. Name the column of ans_4a that contains the median of the obs column obs_median.

5. Add / remove columns, copy, etc.

a. Create a variable named d5a using d5a = data.table(d). Create a deep copy of d5a named d5a_copy and use the := operator to add a column of zeros to d5a_copy. Name the column of zeros new_zeros. Store the resulting data.table in a variable named ans_5a. Be very careful that the code you write and execute here does not modify d5a.

b. Create a variable named d5b using d5b = data.table(d). Create a deep copy of d5b named d5b_copy and use the := operator to remove the obs column from d5b_copy. Store the resulting data.table in a variable named ans_5b. Be very careful that the code you write and execute here does not modify d5b.

c. Consider the following methods of computing the mean value of the obs column in d5c – see below – grouped by month.

d5c <- data.table(d)

# option 1
d5c_2 <- d5c[, .(obs_mean = mean(obs)), .(month)]

# option 2
d5c[, obs_mean := mean(obs), .(month)]

# option 3
d5c <- d5c[, obs_mean := mean(obs), .(month)]

Which option is the least efficient method of using data.table ? Please store your answer as one of the following:

  • ans_5c <- 'option 1'
  • ans_5c <- 'option 2'
  • ans_5c <- 'option 3'

d. Consider the following code chunk.

V1 <- c(1, 2, 3)
V2 <- c('a', 'b', 'c')
dd <- data.table(V1, V2)

# option 1
dd2 <- dd
dd2[, V3 := c(T, T, F)]

# option 2
dd2 <- data.table(dd)
dd2[, V3 := c(T, T, F)]

# option 3
dd2  = copy(dd)
dd2[, V3 := c(T, T, F)]

Which option modifies dd? Please store your answer as one of the following:

  • ans_5d <- 'option 1'
  • ans_5d <- 'option 2'
  • ans_5d <- 'option 3'

6. Getting data into a data.table

  • Use the fread function to read the csv file located at data/maze_2.csv into a data.table.

  • Consider the following character variable:

s2 <- "ind2         obs
        Apr -0.15579551
        Apr -1.47075238
        May -0.47815006
        May  0.41794156
        Jun  1.35867955
        Jun -0.10278773
        Jul  0.38767161
        Jul -0.05380504
        Aug -1.37705956"
  • Convert s2 into a data.table named DT.

7. Wide vs long format

  • Is the following data.table in wide or long format?
##    participant condition maze_time
##          <int>    <char>     <num>
## 1:           1      mesh       3.4
## 2:           1      wood       3.8
## 3:           2      mesh       3.1
## 4:           2      wood       3.6
## 5:           3      mesh       3.5
## 6:           3      wood       3.9
  • Is the following data.table in wide or long format?
##    participant  mesh  wood
##          <int> <num> <num>
## 1:           1   3.4   3.8
## 2:           2   3.1   3.6
## 3:           3   3.5   3.9
  • What function from the data.table library would you use to convert from wide to long?

  • What function from the data.table library would you use to convert from long to wide?

8. Real world example

Consider an experiment designed to find out whether or not reward contingencies influence learning in a simple spatial decision task. In this experiment, rats repeatedly run a T-maze and must learn which arm is more likely to produce a reward.

On each trial the rat starts at the base of the maze and chooses to go left or right at the T-junction. One arm is more likely to produce a reward than the other. Over time, rats learn which choice is better. Learning in this context means gradually adjusting behaviour so that rewarded choices are made more often.

Next, you will perform a cursory analysis on data from this experiment.

For this exercise, use the local file in the tutorial folder. If your working directory is set correctly, you do not need anything beyond the downloaded tutorial folder.

a. Use the fread function in the data.table library to read the file at

data/experiment_1_summary.csv

into a data.table and store the result in a variable named ans_8a.

The resulting data.table contains the following columns:

  • experiment: Experiment identifier.

  • rat_id: Subject identifier (i.e., different numbers correspond to different rats).

  • trial: Trial number within the experiment.

  • cage_context: Context in which the maze was run (e.g., mesh or wood).

  • scent: Scent present in the testing environment.

  • choice: The rat’s choice at the T-junction (L or R).

  • reward: Outcome of the trial (1 = reward, 0 = no reward).

  • reaction_time: Time taken to initiate the choice.

  • maze_run_time: Time taken to run from the start of the maze to the goal location.


b. Select only rows for which cage_context=="mesh" or for which cage_context=="wood" and store the result in a new data.table named ans_8b.


c. From ans_8b select only the columns rat_id, cage_context, and maze_run_time, and store the result in a new data.table named ans_8c.


d.

Do not work on this problem until you have finished earlier parts of the worksheet. When you have, start a timer on a stop watch and then immediately work on the rest of this problem.

Create a new data.table that contains the mean values of the maze_run_time column grouped by rat_id and cage_context. Name the column of mean values maze_run_time_mean. Make sure that your resulting data.table has only rat_id, cage_context, and maze_run_time_mean columns and nothing more (no columns named V1 etc). Also make sure that your resulting data.table only has one observation per rat_id and cage_context. Store the result in a variable named ans_8d.

Stop the timer you started at the beginning of this problem, and write down how long it took you to complete.


e. Do not work on this problem until you have finished earlier parts of the worksheet. When you have, start a timer on a stop watch and then immediately work on the rest of this problem.

Open the file experiment_1_summary.csv from the course data page in Excel or any of your favourite spreadsheet programs. Using this spreadsheet program, perform all of the operations that lead to ans_8d.

Stop the timer you started at the beginning of this problem, and write down how long it took you to complete.

This comparison should make one point clear: for many people, R and data.table quickly become the more efficient option. Even if the first attempt takes longer, the workflow usually becomes faster, more flexible, and more powerful as skill develops.

On the other hand, you might be a super spreadsheet master like these people: https://stackoverflow.com/questions/37314049/how-to-calculate-the-means-of-100s-of-subgroups-in-excel

Even then, this unit uses R, so the goal here is to build confidence with that workflow.