Tutorial 4 - Data wrangling and descriptive statistics

## Warning: package 'data.table' was built under R version 4.3.3

Learning objectives

This is our final week of nuts and bolts with R, data.table, and ggplot2. Our goal at this stage is be feeling pretty comfortable with using these tools to wrangle data, make basic plots, and calculate basic summary statistics.
You have your first homework due at the beginning of next week, so the focus on this tutorial is to give you a a chance to get it finished while your tutor is around to answer your questions and resolve any issues you’re facing.
We will also aim to consolidate the lecture content from this week.

Group activity (~20 minutes)

We will start with a group activity to consolidate the lecture content from this week.
Your tutor will split you into groups of about 3-4 people. Please adhere to these groupings.
As a group, please come up with 3 to 4 quiz questions that you think would be a good assessment of the lecture content this week.
I will attempt to solve a handful of these questions in front of the class or on video.
I get that you might want to make them sneaky. Please do. But more importanlty please try to get some spread in topics so that much or all of the lecture is covered.
Please post your questions (without answers) in the iLearn discussion form. If you are the first batch and there is no thread, then please create it. Also not a problem to create multiple threads if that’s the way it works out.

Work towards finishing your homework

The folllowing may be useful to you in order to wrangle the data for your homework. But please also see the video I recorded on this topic and please also see the iLearn announcement where you can download a clean version of the data (without potentially harmful php files).

Get some experiment data

Complete the experiment located at the following address: https://run.pavlovia.org/demos/simplertt/
Be sure to enter your name or some other ID that you will remember and can be easily searched for.
Download your data in a .csv file by taking the following steps:
Go here: https://gitlab.pavlovia.org/demos/simplertt
Read about the experiment you just participated in by scrolling to the bottom and reading the README.md file.
Click the data folder to land at the following address: https://gitlab.pavlovia.org/demos/simplertt/tree/master/data
Near the top right of the page, click the Find file button, and search for a file containing the unique ID that you entered at the beginning of the experiment.
Click on the .csv file that pops up.
Finally, click the download button to download your .csv file to your local machine.

Analyse your data using `data.table`

Load the data.table library and rm to be sure your are starting with a clean work space.

library(data.table)
library(ggplot2)

rm(list = ls())

Load the data into a data.table using the fread function from the data.table library.

# You need to replace the path I use here with a path that
# points to wherever you have your data stored.
d <- fread('https://crossley.github.io/book_stats/data/real_data/simpleRTT/data_mjc_simpleRTT_2022-02-26_05h13.22.csv')

We can see that d has quite a few columns by printing it to the console (which I won’t do here because the output is very ugly and I want to keep things clean). You, however, should do it, so you can see what I mean.
When data.table objects have lots of columns, str can be a good summary function to use for basic inspection.

str(d)
## Classes 'data.table' and 'data.frame':   28 obs. of  26 variables:
##  $ response.keys            : chr  "space" "space" "space" "space" ...
##  $ response.corr            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ response.rt              : num  0.306 0.273 0.283 0.317 0.3 0.286 0.272 0.3 0.308 0.284 ...
##  $ mouseResp.x              : num  0.228 0.228 0.228 0.228 0.228 ...
##  $ mouseResp.y              : num  -0.334 -0.334 -0.334 -0.334 -0.334 ...
##  $ mouseResp.leftButton     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mouseResp.midButton      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mouseResp.rightButton    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ practiceTrials.thisRepN  : int  0 0 0 0 0 0 0 0 NA NA ...
##  $ practiceTrials.thisTrialN: int  0 1 2 3 4 5 6 7 NA NA ...
##  $ practiceTrials.thisN     : int  0 1 2 3 4 5 6 7 NA NA ...
##  $ practiceTrials.thisIndex : int  1 7 3 2 5 6 4 0 NA NA ...
##  $ practiceTrials.ran       : int  1 1 1 1 1 1 1 1 NA NA ...
##  $ isi                      : num  1.22 2.56 1.67 1.44 2.11 ...
##  $ participant              : chr  "mjc" "mjc" "mjc" "mjc" ...
##  $ session                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ date                     : chr  "2022-02-26_05h13.22" "2022-02-26_05h13.22" "2022-02-26_05h13.22" "2022-02-26_05h13.22" ...
##  $ expName                  : chr  "simpleRTT" "simpleRTT" "simpleRTT" "simpleRTT" ...
##  $ psychopyVersion          : chr  "2021.2.3" "2021.2.3" "2021.2.3" "2021.2.3" ...
##  $ OS                       : chr  "MacIntel" "MacIntel" "MacIntel" "MacIntel" ...
##  $ frameRate                : num  58.8 58.8 58.8 58.8 58.8 ...
##  $ mainTrials.thisRepN      : int  NA NA NA NA NA NA NA NA 0 0 ...
##  $ mainTrials.thisTrialN    : int  NA NA NA NA NA NA NA NA 0 1 ...
##  $ mainTrials.thisN         : int  NA NA NA NA NA NA NA NA 0 1 ...
##  $ mainTrials.thisIndex     : int  NA NA NA NA NA NA NA NA 5 8 ...
##  $ mainTrials.ran           : int  NA NA NA NA NA NA NA NA 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

It’s certainly difficult to know what all of these columns encode. This is something you will get used to as you build and run your own experiments (e.g., as you will in later COGS units). For now, I’ll just tell you. The data contains a row for every trial completed, including practice trials.
You can tell which rows correspond to practice trials and which correspond to experiment trials by examining the columns named practiceTrials.thisN and mainTrials.thisN.

d[, .(practiceTrials.thisN, mainTrials.thisN)]
##     practiceTrials.thisN mainTrials.thisN
##                    <int>            <int>
##  1:                    0               NA
##  2:                    1               NA
##  3:                    2               NA
##  4:                    3               NA
##  5:                    4               NA
##  6:                    5               NA
##  7:                    6               NA
##  8:                    7               NA
##  9:                   NA                0
## 10:                   NA                1
## 11:                   NA                2
## 12:                   NA                3
## 13:                   NA                4
## 14:                   NA                5
## 15:                   NA                6
## 16:                   NA                7
## 17:                   NA                8
## 18:                   NA                9
## 19:                   NA               10
## 20:                   NA               11
## 21:                   NA               12
## 22:                   NA               13
## 23:                   NA               14
## 24:                   NA               15
## 25:                   NA               16
## 26:                   NA               17
## 27:                   NA               18
## 28:                   NA               19
##     practiceTrials.thisN mainTrials.thisN

We can see that mainTrials.thisN is NA during practice and practiceTrials.thisN is NA during the main experiment.
We can also see that our reaction time per trial is stored in a column named response.rt and the inter-stimulus interval (isi) is stored in a column named isi.
response.rt is our main dependent variable of interest, and isi is an independent variable.
A dependent variable is a variable or outcome not controlled by the experimenter but rather observed in the experiment.
An independent variable is a variable or experiment factor controlled by the experimenter that may influence observations from a dependent variable.
You can also see that both response.RT and isi are continuous (as opposed to discrete, or categorical etc.) observations numbers bounded between zero and positive infinity.
We will talk about these key terms in more depth coming up in lecture.
Okay, we are now equipped to pull out just the rows and columns that we need for a simple exploration of our performance.

# We begin by just looking and the main trials
d_main <- d[!is.na(mainTrials.thisN), .(response.rt, isi)]

# We don't have enough data points for a histogram to look
# very good, but it is generally a good place to start so
# you get a feel for the shape of your data.
ggplot(d_main, aes(x=response.rt)) + 
  geom_histogram(bins=30)


# The histogram shows that between 0.2 and 0.3 ms our
# observations seem a bit bell-shaped. However, we can
# clearly see that a response time is never negative and is
# sometimes quite far away from the peak of the bell shape
# (where most of the observations are located).

# From the histogram, you can eye-ball a reasonable guess
# about the central tendency and spread of this sample, but
# it would also be very reasonable to dig a bit deeper with
# some additional plots.
ggplot(d_main, aes(x=0, y=response.rt)) + 
  geom_boxplot() +
  geom_point() +
  xlim(-1, 1)


# Finally, we report basic descriptive statistics as actual
# numbers for the main experiment
d_main[, .(rt_mean=mean(response.rt), 
           rt_median=median(response.rt), 
           rt_sd=sd(response.rt))]
##    rt_mean rt_median     rt_sd
##      <num>     <num>     <num>
## 1: 0.31745     0.269 0.1530361

# Report basic descriptive statistics for the main
# experiment using `chaining` to avoid creating a new
# data.table
d[!is.na(mainTrials.thisN), .(response.rt, isi)][, .(rt_mean=mean(response.rt), 
                                                     rt_median=median(response.rt), 
                                                     rt_sd=sd(response.rt))]
##    rt_mean rt_median     rt_sd
##      <num>     <num>     <num>
## 1: 0.31745     0.269 0.1530361

# You should check these numbers against the plots you've
# made to ensure they are all consistent with each other.

Report basic descriptive statistics for practice and for the main experiment at the same time. To do this, it will be very helpful to have a new single column that indicates whether a particular row id d corresponds to a practice trial or to a main trial.

d[, phase := "Unknown"]
d[!is.na(practiceTrials.thisN), phase := "practice"]
d[!is.na(mainTrials.thisN), phase := "main"]

d[, .(phase, response.rt)]
##        phase response.rt
##       <char>       <num>
##  1: practice       0.306
##  2: practice       0.273
##  3: practice       0.283
##  4: practice       0.317
##  5: practice       0.300
##  6: practice       0.286
##  7: practice       0.272
##  8: practice       0.300
##  9:     main       0.308
## 10:     main       0.284
## 11:     main       0.322
## 12:     main       0.267
## 13:     main       0.284
## 14:     main       0.266
## 15:     main       0.271
## 16:     main       0.251
## 17:     main       0.241
## 18:     main       0.267
## 19:     main       0.302
## 20:     main       0.260
## 21:     main       0.250
## 22:     main       0.300
## 23:     main       0.334
## 24:     main       0.266
## 25:     main       0.236
## 26:     main       0.440
## 27:     main       0.939
## 28:     main       0.261
##        phase response.rt

The reason having this new column helps so much is because we can now use it to group descriptive statistics as follows:

d[, .(rt_mean=mean(response.rt), 
      rt_median=median(response.rt), 
      rt_sd=sd(response.rt)), 
  .(phase)]
##       phase  rt_mean rt_median      rt_sd
##      <char>    <num>     <num>      <num>
## 1: practice 0.292125     0.293 0.01615494
## 2:     main 0.317450     0.269 0.15303611

It also facilitates plotting comparisons as follows:

ggplot(d, aes(x=phase, y=response.rt)) + 
  geom_boxplot()

From here, we can ask if there is any relationship between our dependent and our independent variables. When looking for relationships between two variables, one of the first things we usually do is plot one of them on the x-axis and the other of them on the y-axis.

ggplot(d, aes(x=isi, y=response.rt)) +
  geom_point()

No obvious relationship jumps out to my eye. You?

Tutorial 4 - Data wrangling and descriptive statistics

Author: Matthew J. Crossley

Last update: 20 May, 2025

Learning objectives

Group activity (~20 minutes)

Work towards finishing your homework

Get some experiment data

Analyse your data using `data.table`

Tutorial 4 - Data wrangling and descriptive statistics

Author: Matthew J. Crossley

Last update: 20 May, 2025

Learning objectives

Group activity (~20 minutes)

Work towards finishing your homework

Get some experiment data

Analyse your data using data.table

Analyse your data using `data.table`