Learning objectives

  • Learn a repeatable workflow for getting data into a single data.table when the data are stored as:

    • one clean file
    • many clean files in a directory
    • many clean files spread across sub-directories
    • messy files in messy formats
  • Learn how to locate dependent variables (things we measure) and independent variables (things we control) in an experimental dataset.

  • Learn how to recognise common experimental designs in the structure of the data, including within-subject and between-subject designs.

  • By the end of this tutorial, you should be able to wrangle the data that you were assigned for your final project into a single data.table that you can analyse in R.

Tutorial data

The data used in this tutorial can be downloaded from:

https://github.com/crossley/cogs2020/tree/main/tutorials

Download the repository (or just the tutorials folder) and make sure the data folder is located in your current working directory before continuing.


Part 1 - Clean data in a single file

This is the simplest case: one clean csv.

In our simulated T-maze dataset, each row is a trial and each column is a variable. Some columns are dependent variables (e.g., reaction_time, maze_run_time) and some columns are independent variables (e.g., cage_context, scent, experiment).

d_clean <- fread("data/experiment_1_summary.csv")

str(d_clean)
## Classes 'data.table' and 'data.frame':   4800 obs. of  9 variables:
##  $ experiment   : chr  "exp1" "exp1" "exp1" "exp1" ...
##  $ rat_id       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trial        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cage_context : chr  "mesh" "wood" "mesh" "mesh" ...
##  $ scent        : chr  "none" "none" "none" "peppermint" ...
##  $ choice       : chr  "L" "R" "L" "L" ...
##  $ reward       : int  1 0 1 1 1 1 0 1 1 1 ...
##  $ reaction_time: num  0.584 0.673 0.542 0.536 0.377 ...
##  $ maze_run_time: num  3.79 3.21 2.14 2.65 2.73 ...
##  - attr(*, ".internal.selfref")=<externalptr>
d_clean[1:10]
##     experiment rat_id trial cage_context      scent choice reward reaction_time maze_run_time
##         <char>  <int> <int>       <char>     <char> <char>  <int>         <num>         <num>
##  1:       exp1      1     1         mesh       none      L      1     0.5842096      3.789606
##  2:       exp1      1     2         wood       none      R      0     0.6728155      3.209561
##  3:       exp1      1     3         mesh       none      L      1     0.5416682      2.137541
##  4:       exp1      1     4         mesh peppermint      L      1     0.5361437      2.648328
##  5:       exp1      1     5         mesh      lemon      L      1     0.3766689      2.727809
##  6:       exp1      1     6         mesh peppermint      L      1     0.6000501      2.699873
##  7:       exp1      1     7         mesh      lemon      R      0     0.6818149      3.493251
##  8:       exp1      1     8         mesh peppermint      R      1     0.4834151      2.367440
##  9:       exp1      1     9         wood      lemon      L      1     0.6967780      3.369233
## 10:       exp1      1    10         wood      lemon      L      1     0.6031714      2.495457

At this point we are done: the data are in one data.table.


Part 2 - Clean data distributed across many files

(one directory)

In real experiments, you often get one file per subject. This is not an “analysis choice”. It is just how data collection often works (each subject completes the experiment at a different time, and the software writes one file at the end of the session).

In our case, each rat has one file in:

data/experiment_1/

This is still the same experiment design as Part 1. The data are just distributed across multiple files.


a. Point R at the directory.

data_dir <- "data/experiment_1"

b. List the files.

files <- list.files(
  data_dir,
  pattern = "\\.csv$",
  full.names = TRUE
)

files

c. Read each file and store it in a list of data.tables.

storage_list <- list()

for(i in 1:length(files)) {
  storage_list[[i]] <- fread(files[i])
}

d. Combine the list into one data.table.

d_all <- rbindlist(storage_list, fill = TRUE)

str(d_all)
d_all[1:10]

At this point we are done: all files are in one data.table.


Part 3 - Clean data across multiple sub-directories

Sometimes files are organised by condition or by experiment.

This is where experimental design starts to show up clearly in the directory structure.

In our simulated dataset we have:

experiment_1/ experiment_2/

These two experiments differ in the reward structure (i.e., which arm is more likely to produce reward). That is an independent variable, and it is manipulated between-subjects (a given rat belongs to only one experiment folder).

Within each experiment, rats experience multiple contexts (cage_context) across trials. That is also an independent variable, and it is manipulated within-subjects (the same rat experiences both contexts across trials).

The goal is still the same:

  • read everything
  • add a label
  • combine into one data.table

a. Define where the folder lives.

root_dir <- "data"

b. Define sub-directories and their labels.

sub_dirs <- c("experiment_1", "experiment_2")

c. Loop over directories, then loop over files. Add a label based on the directory name.

storage_list <- list()
k <- 1

for(j in 1:length(sub_dirs)) {

  dir_path <- file.path(root_dir, sub_dirs[j])

  files <- list.files(
    dir_path,
    pattern = "\\.csv$",
    full.names = TRUE
  )

  for(i in 1:length(files)) {

    tmp <- fread(files[i])

    # label rows by the folder they came from
    tmp[, source_dir := sub_dirs[j]]

    storage_list[[k]] <- tmp
    k <- k + 1
  }
}

d. Combine into one data.table.

d_all <- rbindlist(storage_list, fill = TRUE)

str(d_all)
d_all[1:10]

At this point we are done: all sub-directories are in one data.table.


Part 4 - Messy real-world data

In real experiments, data are often stored in a way that is useful for the experiment software, but not immediately useful for analysis.

For example, platforms like Pavlovia tend to write out:

  • extra columns you did not explicitly design
  • rows that are not trials (instructions, breaks, etc.)
  • missing values encoded as "" or "None"
  • numeric columns stored as characters

Our simulated dataset includes a “Pavlovia-like” directory that captures the gist of this without being too chaotic:

data/messy_pavlovia/

The experiment design is still the same. The problem is just that the file contains lots of stuff that is not part of the trial-level dataset we want to analyse.


Analyse one messy file using data.table

a. Load the data.table library and rm to be sure you are starting with a clean work space.

library(data.table)
library(ggplot2)

rm(list = ls())

b. Load the data into a data.table using fread.

d <- fread("data/messy_pavlovia/rat_01.csv")

c. Inspect the data using str.

str(d)
## Classes 'data.table' and 'data.frame':   205 obs. of  19 variables:
##  $ participant  : chr  "rat_01" "rat_01" "rat_01" "rat_01" ...
##  $ session      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ date         : chr  "2026-02-12_1820" "2026-02-12_1820" "2026-02-12_1820" "2026-02-12_1820" ...
##  $ expName      : chr  "tmaze_learning" "tmaze_learning" "tmaze_learning" "tmaze_learning" ...
##  $ routine      : chr  "break" "break" "break" "break" ...
##  $ rat_id       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trial        : int  50 100 150 200 NA 1 2 3 4 5 ...
##  $ cage_context : chr  "" "" "" "" ...
##  $ scent        : chr  "" "" "" "" ...
##  $ choice.keys  : chr  "" "" "" "" ...
##  $ choice       : chr  "" "" "" "" ...
##  $ reward       : int  NA NA NA NA NA 1 0 1 0 1 ...
##  $ choice.rt    : chr  "" "" "" "" ...
##  $ reaction_time: num  NA NA NA NA NA ...
##  $ maze_run_time: num  NA NA NA NA NA ...
##  $ browser      : chr  "Firefox" "Firefox" "Firefox" "Firefox" ...
##  $ os           : chr  "Linux" "Linux" "Linux" "Linux" ...
##  $ frameRate    : num  58.9 58.9 58.9 58.9 58.9 ...
##  $ experiment   : chr  "exp1" "exp1" "exp1" "exp1" ...
##  - attr(*, ".internal.selfref")=<externalptr>

d. The first thing to do with messy files is to identify which rows correspond to the observations you care about. Here, the routine column tells you whether a row is a "trial", a "break", or "instructions".

Select only the trial rows, and keep only the columns we need for analysis.

d_trials <- d[
  routine=="trial",
  .(participant, rat_id, trial,
    cage_context, scent,
    choice, reward,
    reaction_time, maze_run_time,
    choice.rt)
]

e. Clean up missing values in choice.rt and make sure it is numeric.

d_trials[choice.rt=="" | choice.rt=="None", choice.rt := NA]
d_trials[, choice.rt := as.numeric(choice.rt)]

f. Make a quick plot of a dependent variable.

ggplot(d_trials, aes(x = reaction_time)) +
  geom_histogram(bins = 30)


Part 5 - Messy real-world data (entire directory)

Now assume you have one messy file per subject in a directory. The goal is to do the same cleaning as Part 4, but repeated across all subjects:

  • read them all
  • keep only the trial rows
  • keep only the columns you need
  • clean obvious missing-value issues
  • combine everything into one data.table

a. Point R at the directory.

data_dir <- "data/messy_pavlovia"

b. List the files.

files <- list.files(
  data_dir,
  pattern = "\\.csv$",
  full.names = TRUE
)

files

c. Loop over files. For each file:

  • read it
  • keep only routine=="trial"
  • keep only the columns you need
  • clean choice.rt
  • store it in a list
storage_list <- list()

for(i in 1:length(files)) {

  tmp <- fread(files[i])

  tmp <- tmp[
    routine=="trial",
    .(participant, rat_id, trial,
      cage_context, scent,
      choice, reward,
      reaction_time, maze_run_time,
      choice.rt)
  ]

  tmp[choice.rt=="" | choice.rt=="None", choice.rt := NA]
  tmp[, choice.rt := as.numeric(choice.rt)]

  storage_list[[i]] <- tmp
}

d. Combine everything into one data.table.

d_all <- rbindlist(storage_list, fill = TRUE)

str(d_all)
d_all[1:10]

At this point we are done: the messy directory has been pulled into one clean-ish data.table that you can actually analyse.

Part 6 - Wrangle data for your final project

You now have the tools to wrangle your assigned dataset into a single data.table that you can analyse in R. Good luck!