Learn a repeatable workflow for getting data into a single
data.table when the data are stored as:
Learn how to locate dependent variables (things we measure) and independent variables (things we control) in an experimental dataset.
Learn how to recognise common experimental designs in the structure of the data, including within-subject and between-subject designs.
By the end of this tutorial, you should be able to wrangle the
data that you were assigned for your final project into a single
data.table that you can analyse in R.
The data used in this tutorial can be downloaded from:
https://github.com/crossley/cogs2020/tree/main/tutorials
Download the repository (or just the tutorials folder)
and make sure the data folder is located in your current
working directory before continuing.
This is the simplest case: one clean csv.
In our simulated T-maze dataset, each row is a trial and each column
is a variable. Some columns are dependent variables
(e.g., reaction_time, maze_run_time) and some
columns are independent variables (e.g.,
cage_context, scent,
experiment).
d_clean <- fread("data/experiment_1_summary.csv")
str(d_clean)
## Classes 'data.table' and 'data.frame': 4800 obs. of 9 variables:
## $ experiment : chr "exp1" "exp1" "exp1" "exp1" ...
## $ rat_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ trial : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cage_context : chr "mesh" "wood" "mesh" "mesh" ...
## $ scent : chr "none" "none" "none" "peppermint" ...
## $ choice : chr "L" "R" "L" "L" ...
## $ reward : int 1 0 1 1 1 1 0 1 1 1 ...
## $ reaction_time: num 0.584 0.673 0.542 0.536 0.377 ...
## $ maze_run_time: num 3.79 3.21 2.14 2.65 2.73 ...
## - attr(*, ".internal.selfref")=<externalptr>
d_clean[1:10]
## experiment rat_id trial cage_context scent choice reward reaction_time maze_run_time
## <char> <int> <int> <char> <char> <char> <int> <num> <num>
## 1: exp1 1 1 mesh none L 1 0.5842096 3.789606
## 2: exp1 1 2 wood none R 0 0.6728155 3.209561
## 3: exp1 1 3 mesh none L 1 0.5416682 2.137541
## 4: exp1 1 4 mesh peppermint L 1 0.5361437 2.648328
## 5: exp1 1 5 mesh lemon L 1 0.3766689 2.727809
## 6: exp1 1 6 mesh peppermint L 1 0.6000501 2.699873
## 7: exp1 1 7 mesh lemon R 0 0.6818149 3.493251
## 8: exp1 1 8 mesh peppermint R 1 0.4834151 2.367440
## 9: exp1 1 9 wood lemon L 1 0.6967780 3.369233
## 10: exp1 1 10 wood lemon L 1 0.6031714 2.495457
At this point we are done: the data are in one
data.table.
In real experiments, you often get one file per subject. This is not an “analysis choice”. It is just how data collection often works (each subject completes the experiment at a different time, and the software writes one file at the end of the session).
In our case, each rat has one file in:
data/experiment_1/
This is still the same experiment design as Part 1. The data are just distributed across multiple files.
a. Point R at the directory.
data_dir <- "data/experiment_1"
b. List the files.
files <- list.files(
data_dir,
pattern = "\\.csv$",
full.names = TRUE
)
files
c. Read each file and store it in a list of
data.tables.
storage_list <- list()
for(i in 1:length(files)) {
storage_list[[i]] <- fread(files[i])
}
d. Combine the list into one
data.table.
d_all <- rbindlist(storage_list, fill = TRUE)
str(d_all)
d_all[1:10]
At this point we are done: all files are in one
data.table.
Sometimes files are organised by condition or by experiment.
This is where experimental design starts to show up clearly in the directory structure.
In our simulated dataset we have:
experiment_1/ experiment_2/
These two experiments differ in the reward structure (i.e., which arm is more likely to produce reward). That is an independent variable, and it is manipulated between-subjects (a given rat belongs to only one experiment folder).
Within each experiment, rats experience multiple contexts
(cage_context) across trials. That is also an
independent variable, and it is manipulated
within-subjects (the same rat experiences both contexts
across trials).
The goal is still the same:
data.tablea. Define where the folder lives.
root_dir <- "data"
b. Define sub-directories and their labels.
sub_dirs <- c("experiment_1", "experiment_2")
c. Loop over directories, then loop over files. Add a label based on the directory name.
storage_list <- list()
k <- 1
for(j in 1:length(sub_dirs)) {
dir_path <- file.path(root_dir, sub_dirs[j])
files <- list.files(
dir_path,
pattern = "\\.csv$",
full.names = TRUE
)
for(i in 1:length(files)) {
tmp <- fread(files[i])
# label rows by the folder they came from
tmp[, source_dir := sub_dirs[j]]
storage_list[[k]] <- tmp
k <- k + 1
}
}
d. Combine into one data.table.
d_all <- rbindlist(storage_list, fill = TRUE)
str(d_all)
d_all[1:10]
At this point we are done: all sub-directories are in one
data.table.
In real experiments, data are often stored in a way that is useful for the experiment software, but not immediately useful for analysis.
For example, platforms like Pavlovia tend to write out:
"" or
"None"Our simulated dataset includes a “Pavlovia-like” directory that captures the gist of this without being too chaotic:
data/messy_pavlovia/
The experiment design is still the same. The problem is just that the file contains lots of stuff that is not part of the trial-level dataset we want to analyse.
data.tablea. Load the data.table library and
rm to be sure you are starting with a clean work space.
library(data.table)
library(ggplot2)
rm(list = ls())
b. Load the data into a data.table
using fread.
d <- fread("data/messy_pavlovia/rat_01.csv")
c. Inspect the data using str.
str(d)
## Classes 'data.table' and 'data.frame': 205 obs. of 19 variables:
## $ participant : chr "rat_01" "rat_01" "rat_01" "rat_01" ...
## $ session : int 1 1 1 1 1 1 1 1 1 1 ...
## $ date : chr "2026-02-12_1820" "2026-02-12_1820" "2026-02-12_1820" "2026-02-12_1820" ...
## $ expName : chr "tmaze_learning" "tmaze_learning" "tmaze_learning" "tmaze_learning" ...
## $ routine : chr "break" "break" "break" "break" ...
## $ rat_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ trial : int 50 100 150 200 NA 1 2 3 4 5 ...
## $ cage_context : chr "" "" "" "" ...
## $ scent : chr "" "" "" "" ...
## $ choice.keys : chr "" "" "" "" ...
## $ choice : chr "" "" "" "" ...
## $ reward : int NA NA NA NA NA 1 0 1 0 1 ...
## $ choice.rt : chr "" "" "" "" ...
## $ reaction_time: num NA NA NA NA NA ...
## $ maze_run_time: num NA NA NA NA NA ...
## $ browser : chr "Firefox" "Firefox" "Firefox" "Firefox" ...
## $ os : chr "Linux" "Linux" "Linux" "Linux" ...
## $ frameRate : num 58.9 58.9 58.9 58.9 58.9 ...
## $ experiment : chr "exp1" "exp1" "exp1" "exp1" ...
## - attr(*, ".internal.selfref")=<externalptr>
d. The first thing to do with messy files is to
identify which rows correspond to the observations you care about. Here,
the routine column tells you whether a row is a
"trial", a "break", or
"instructions".
Select only the trial rows, and keep only the columns we need for analysis.
d_trials <- d[
routine=="trial",
.(participant, rat_id, trial,
cage_context, scent,
choice, reward,
reaction_time, maze_run_time,
choice.rt)
]
e. Clean up missing values in choice.rt
and make sure it is numeric.
d_trials[choice.rt=="" | choice.rt=="None", choice.rt := NA]
d_trials[, choice.rt := as.numeric(choice.rt)]
f. Make a quick plot of a dependent variable.
ggplot(d_trials, aes(x = reaction_time)) +
geom_histogram(bins = 30)
Now assume you have one messy file per subject in a directory. The goal is to do the same cleaning as Part 4, but repeated across all subjects:
data.tablea. Point R at the directory.
data_dir <- "data/messy_pavlovia"
b. List the files.
files <- list.files(
data_dir,
pattern = "\\.csv$",
full.names = TRUE
)
files
c. Loop over files. For each file:
routine=="trial"choice.rtstorage_list <- list()
for(i in 1:length(files)) {
tmp <- fread(files[i])
tmp <- tmp[
routine=="trial",
.(participant, rat_id, trial,
cage_context, scent,
choice, reward,
reaction_time, maze_run_time,
choice.rt)
]
tmp[choice.rt=="" | choice.rt=="None", choice.rt := NA]
tmp[, choice.rt := as.numeric(choice.rt)]
storage_list[[i]] <- tmp
}
d. Combine everything into one
data.table.
d_all <- rbindlist(storage_list, fill = TRUE)
str(d_all)
d_all[1:10]
At this point we are done: the messy directory has been pulled into
one clean-ish data.table that you can actually analyse.
You now have the tools to wrangle your assigned dataset into a single
data.table that you can analyse in R. Good luck!