Homework 1 Solutions

Learning objectives

Your ability to use R at a basic level.
Your ability to use built-in R help documents.
Your ability to wrangle data at a basic level in R using the data.table package.
Your ability to interact with data.table objects to report simple descriptive statistics.
Your ability to use data.table objects and ggplot to illustrate basic trends in data.
Your ability to perform basic file I/O and the corresponding data wrangling that ensues.
Your ability to tie descriptive statistics as numbers to data visualisations.

Instructions

Write a .Rmd script that replicates each ggplot figure and data.table output shown below exactly.
When you are finished, knit your .Rmd to a .pdf file being careful to set eval=T and echo=T for each code chunk so that both your code syntax and plots / outut are visible. Please submit the resulting .pdf file through ilearn.
This homework is structured as walk-thru of an analysis of a simple experiment. Throughout, I write comments and ask prompting questions. Usually, these sorts of things are bulleted. In any case, they are there simply to guide you along my thought process, and in so doing, are there to give you hints that direct you towards methods that will help you reproduce my output and plots.
Again, you are marked on your ability to reproduce my plots and output, and also perhaps on your ability to explain and defend your code should it be flagged as seeming too heavily influenced by ChatGPT. It is fine, by the way, to have even all your code written by ChatGPT, but only if it produces a deep understanding of your own code. You may lose marks if you write code that you do not understand.

Marking

20% of your overall mark for this assessment will be determined from whether or not your code runs without error. If it does not, then the full 20% will be given. If it does not, then you will lose this 20%. No partial credit will be given here.
40% of your overall mark for this assessment will be determined from whether or not you replicate the ggplot figures exactly as shown below (or as exactly as possible given the note below). If you get close but are not exact (some core concept is missing), then you will be awarded half of the total marks for that problem. All ggplot figures are equally weighted with respect to marks.
40% of your overall mark for this assessment will be determined from whether or not you replicate the data.table output exactly as shown below (or as exactly as possible given the note below). If you get close but are not exact (some core concept is missing), then you will be awarded half of the total marks for that problem. All data.table output is weighted equally with respect to marks.

0.

Load the data.table, ggplot2, and ggpubr libraries and remove any lingering variables from your R session memory.

library(data.table)
library(ggplot2)
library(ggpubr)
rm(list=ls())

1.

This problem set uses data from an online experiment located here: https://run.pavlovia.org/demos/stroop/
You should participate in the experiment so that you get a good feel for what it consists of.
You can get the data for this experiment here: https://gitlab.pavlovia.org/demos/stroop
Download the entire data directory, wrangle the data into a single data.table, and add an column named subject that indicates from what subject each observation was taken.
Your data.table should look very similar to the following. Please note that at this stage it is okay if your numbers are a bit different from what is displayed here. We will have more to say about this in the next section.

# Load the `data.table` and `ggplot2` libraries and `rm` to
# be sure your are starting with a clean work space.
library(data.table)
library(ggplot2)

rm(list = ls())

# Use the `list.files` function to save a list of file paths
# corresponding to your just downloaded data files. Here, I
# have written `'data/stroop'` because that is where I put
# the data that I downloaded. You will need change this to
# something appropriate to where you put these files.
file_list <- list.files('Your_file_path_here', 
                        full.names=T, 
                        pattern='.csv$' # select only .csv files
                        ) 

# Here, we have used the `pattern` argument of the
# `list.files` function to select only the files in our data
# directory that have a `.csv` in their file name. The
# trailing `$` is saying that the `.csv` bit needs to come
# at the end of the file name. However, we will still have
# to deal with the fact that some of these csv files are
# empty.

# create an empty list with as many elements as we have data
# files
d_list <- vector("list", length(file_list))

# iterate through our list of data files and read each into
# a data.table, which we then store in d_list
for(i in 1:length(file_list)) {
  
  # Check the file size, and if it's too small to be actual
  # data, then skip it
  if( file.size(file_list[[i]]) > 5){
    
    z <- fread(file_list[[i]])
    z[, subject := i]
    d_list[[i]] <- z
  
  }
}

# bind our list of data.tables into one big data.table
d <- rbindlist(d_list, fill=T)

# ask how many unique subjects we have (which for us in this
# example is equivalent to the number of unique files you've
# loaded)
# d[, length(unique(subject))]
str(d)

2.

The main reason why your data.table may contain different values from mine is that the data comes from an experiment that is open to the public and run online. This means that the data is not static. It is constantly being added to as new people participate in the experiment.
To demonstrate this, lets visualise the dates of the data that we have collected. To do this, we will first need to convert the date column – which is currenly of type “character” – to a POSIXct object (i.e., a date-time object). The following code does this (please include this in your submitted files).

# convert the date column to a POSIXct object
d[, date_time := as.POSIXct(date, format = "%Y-%m-%d_%Hh%M.%S.%OS")]

# visualise the distribution of dates remaining in our data
# (should not contain any dates after 2024-01-01)
ggplot(d, aes(x = date_time)) +
  geom_histogram(binwidth = 86400, color = "black", fill = "blue") +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y") +
  labs(title = "Histogram of Dates", x = "Date", y = "Frequency") +
  theme_minimal()

To ensure that you are able to replciate my results exactly we will filter out any data that was collected after 2024-01-01. The following code chunk achieves this (please include this in your submitted files):

# filter out any data that was collected after 2024-01-01
d <- d[date_time < as.POSIXct("2024-01-01")]

# visualise the distribution of dates remaining in our data
# (should not contain any dates after 2024-01-01)
ggplot(d, aes(x = date_time)) +
  geom_histogram(binwidth = 86400, color = "black", fill = "blue") +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y") +
  labs(title = "Histogram of Dates", x = "Date", y = "Frequency") +
  theme_minimal()

3.

We should now all have the exact same data in our data.table object and can move on to the next step.
Lets get a feel for the shapes and patterns of our data. We will start by looking at the distribution of response times in our data.

ggplot(d, aes(resp.rt)) + 
  geom_histogram(binwidth=0.01) +
    labs(title = "Distribution of Response Time",
         x = "Response Time",
         y = "Count") +
      theme_minimal()

This plot shows us that we have some very extreme observations that almost certainly do not reflect the performance of engaged participants.
Lets have a look at a more constrained range to shed some light on the issue.

ggplot(d[resp.rt < 10], aes(resp.rt)) + 
  geom_histogram(binwidth=0.01) +
    labs(title = "Distribution of Response Time",
         x = "Response Time",
         y = "Count") +
      theme_minimal()

Here we see that almost no responses take longer than 2.5 seconds, and almost zero by 5 seconds. We also see that there is a rise in response times near zero at (unrealistically fast). The fast times are likely the result of people blindly pushing any key without really about the task.
The slow times are easy to deal with (see next bullet), but the fast times are not so trivial because we can’t know which of these come from engaged people taking a long time versus which of these come from people who are not at all engaged and are just pushing buttons.
Fortunately, it seems that these extreme data points are very rare, so they will likely be excluded simply by excluding the most exptreme 1% of response times.
After making this exclusion, the distribution of response times looks like this:

d <- d[resp.rt < quantile(resp.rt, 0.99, na.rm=T)]
ggplot(d, aes(resp.rt)) + 
  geom_histogram(binwidth=0.01) +
    labs(title = "Distribution of Response Time",
         x = "Response Time",
         y = "Count") +
      theme_minimal()

4.

The very fast response times probably correspond to participants ignoring the task and instead simply pressing the button as quickly as they can over and over so that the experiment will end quickly. This hypothesis predicts that some people just don’t try, and these people should have very poor accuracy. This might give us a sound way to catch and exclude this source of noise.
Create a new column in your data.table that encodes response accuracy. Compute the average response accuracy per subject as well as the average response time per subject.
The above step is very important. It is the step were we go from having multiple observations from each subject to having a single observation per subject. That is, we collapsed across trials.
Visual the result of these operations with hsitograms of response times and response accuracy.

# first create a column to indicate response accuracy
d[, acc := corrAns == resp.keys]
dd <- d[, .(mean_acc = mean(acc), mean_rt = mean(resp.rt)), .(subject)]

g_rt <- ggplot(dd, aes(x=mean_rt)) +
  geom_histogram(binwidth=0.025) +
    labs(title = "Distribution of Response Time",
         x = "Mean Response Time",
         y = "Count") +
      theme_minimal()

g_acc <- ggplot(dd, aes(x=mean_acc)) +
  geom_histogram(binwidth=0.05) +
    labs(title = "Distribution of Response Accuracy",
         x = "Mean Response Accuracy",
         y = "Count") +
      theme_minimal()

ggarrange(g_rt, g_acc, ncol=2)

We see that the suspicious fast response times are also present in this new data.table – although they do appear less prominant – which means that they may indeed be mostly coming from a subset of participants.
We can also see that some people failed to get any correct at all (0% accuracy). That may be remarkable in its own right, but for now, lets see if it helps us understand the source of the very fast response times. Exclude subjects that on average got less than 50% correct and remake the response time histogram.

exc_subs <- dd[mean_acc < 0.5, subject]

d <- d[!(subject %in% exc_subs)]
dd <- dd[!(subject %in% exc_subs)]

g_rt <- ggplot(dd, aes(x=mean_rt)) +
  geom_histogram(binwidth=0.025) +
    labs(title = "Distribution of Response Time",
         x = "Mean Response Time",
         y = "Count") +
      theme_minimal()

g_acc <- ggplot(dd, aes(x=mean_acc)) +
  geom_histogram(binwidth=0.05) +
    labs(title = "Distribution of Response Accuracy",
         x = "Mean Response Accuracy",
         y = "Count") +
      theme_minimal()

ggarrange(g_rt, g_acc, ncol=2)

That indeed seemed to play a role in the fast response time issue, so lets use this version of the data as we move forward.

5.

The classic Stroop finding is that people are slower and less accurate on incongruent trials (i.e., trials on which stimulus text and stimulus colour do not match) than they are on congruent trials. Is this what we see in our data?

dd <- d[, .(mean_acc = mean(acc), mean_rt = mean(resp.rt)), .(subject, congruent)]
dd <- dd[!(subject %in% exc_subs)]

g1 <- ggplot(dd, aes(x=mean_rt, fill=factor(congruent))) +
    geom_histogram(bins=50) +
    labs(title = "Distribution of Mean Response Time",
       x = "Mean Response Time",
       y = "Frequency",
       fill = "Congruency") +
      theme_minimal() +
      theme(legend.position = "top",
            legend.box.spacing = unit(0.2, "cm"),
            legend.title = element_text(size = 12),
            legend.text = element_text(size = 10),
            plot.title = element_text(hjust = 0.5))

g2 <- ggplot(dd, aes(mean_acc, fill=factor(congruent))) +
    geom_histogram(bins=20) +
    labs(title = "Distribution of Mean Response Accuracy",
       x = "Mean Accuracy",
       y = "Frequency",
       fill = "Congruency") +
      theme_minimal() +
      theme(legend.position = "top",
            legend.box.spacing = unit(0.2, "cm"),
            legend.title = element_text(size = 12),
            legend.text = element_text(size = 10),
            plot.title = element_text(hjust = 0.5))

g3 <- ggplot(dd, aes(x=as.factor(congruent), y=mean_rt)) +
    geom_boxplot() + 
    labs(
       x = "Congruency",
       y = "Mean Response Time",
       fill = "Congruency") +
      theme_minimal()

g4 <- ggplot(dd, aes(x=as.factor(congruent), y=mean_acc)) +
    geom_boxplot() +
    labs(
       x = "Congruency",
       y = "Mean Response Accuracy",
       fill = "Congruency") +
      theme_minimal()

ggarrange(g1, g2, g3, g4, ncol=2, nrow=2)

It looks like we have the effect we thought we should see (i.e., a Stroop effect), although it is quite a bit smaller than I exected.

Homework 1 Solutions

Author: Matthew J. Crossley

Last update: 26 April, 2024

Learning objectives

Instructions

Marking

0.

1.

2.

3.

4.

5.