Homework 1

Learning objectives

Your ability to use R at a basic level.
Your ability to use built-in R help documents.
Your ability to wrangle data at a basic level in R using the data.table package.
Your ability to interact with data.table objects to report simple descriptive statistics.
Your ability to use data.table objects and ggplot to illustrate basic trends in data.
Your ability to perform basic file I/O and the corresponding data wrangling that ensues.
Your ability to tie descriptive statistics as numbers to data visualisations.

Instructions

Write a .Rmd script that replicates each ggplot figure and data.table output shown below exactly.
When you are finished, knit your .Rmd to a .pdf file being careful to set eval=T and echo=T for each code chunk so that both your code syntax and plots / outut are visible. Please submit the resulting .pdf file through ilearn.
This homework is structured as walk-thru of an analysis of a simple experiment. Throughout, I write comments and ask prompting questions. Usually, these sorts of things are bulleted. In any case, they are there simply to guide you along my thought process, and in so doing, are there to give you hints that direct you towards methods that will help you reproduce my output and plots.
Again, you are marked on your ability to reproduce my plots and output, and also perhaps on your ability to explain and defend your code should it be flagged as seeming too heavily influenced by ChatGPT. It is fine, by the way, to have even all your code written by ChatGPT, but only if it produces a deep understanding of your own code. You may lose marks if you write code that you do not understand.

Marking

20% of your overall mark for this assessment will be determined from whether or not your code runs without error. If it does not, then the full 20% will be given. If it does not, then you will lose this 20%. No partial credit will be given here.
40% of your overall mark for this assessment will be determined from whether or not you replicate the ggplot figures exactly as shown below (or as exactly as possible given the note below). If you get close but are not exact (some core concept is missing), then you will be awarded half of the total marks for that problem. All ggplot figures are equally weighted with respect to marks.
40% of your overall mark for this assessment will be determined from whether or not you replicate the data.table output exactly as shown below (or as exactly as possible given the note below). If you get close but are not exact (some core concept is missing), then you will be awarded half of the total marks for that problem. All data.table output is weighted equally with respect to marks.

0.

Load the data.table, ggplot2, and ggpubr libraries and remove any lingering variables from your R session memory.

library(data.table)
library(ggplot2)
library(ggpubr)
rm(list=ls())

1.

This problem set uses data from an online experiment located here: https://run.pavlovia.org/demos/stroop/
You should participate in the experiment so that you get a good feel for what it consists of.
You can get the data for this experiment here: https://gitlab.pavlovia.org/demos/stroop
Download the entire data directory, wrangle the data into a single data.table, and add an column named subject that indicates from what subject each observation was taken.
Your data.table should look very similar to the following. Please note that at this stage it is okay if your numbers are a bit different from what is displayed here. We will have more to say about this in the next section.

## Classes 'data.table' and 'data.frame':   12024 obs. of  22 variables:
##  $ participant      : chr  "0" "0" "0" "0" ...
##  $ session          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ date             : chr  "2023-12-20_17h02.42.844" "2023-12-20_17h02.42.844" "2023-12-20_17h02.42.844" "2023-12-20_17h02.42.844" ...
##  $ expName          : chr  "stroop" "stroop" "stroop" "stroop" ...
##  $ psychopyVersion  : chr  "2023.1.3" "2023.1.3" "2023.1.3" "2023.1.3" ...
##  $ OS               : chr  "Win32" "Win32" "Win32" "Win32" ...
##  $ frameRate        : num  60.2 60.2 60.2 60.2 40 ...
##  $ resp.keys        : chr  "" "left" "right" "right" ...
##  $ resp.corr        : int  NA 0 1 0 NA 1 1 1 1 1 ...
##  $ resp.rt          : num  NA 14.741 0.477 0.311 NA ...
##  $ resp.duration    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ trials.thisRepN  : int  NA 0 0 0 NA 0 0 0 0 0 ...
##  $ trials.thisTrialN: int  NA 0 1 2 NA 0 1 2 3 4 ...
##  $ trials.thisN     : int  NA 0 1 2 NA 0 1 2 3 4 ...
##  $ trials.thisIndex : int  NA 4 3 2 NA 3 0 1 2 4 ...
##  $ trials.ran       : int  NA 1 1 1 NA 1 1 1 1 1 ...
##  $ text             : chr  "" "blue" "green" "green" ...
##  $ letterColor      : chr  "" "blue" "blue" "green" ...
##  $ corrAns          : chr  "" "right" "right" "down" ...
##  $ congruent        : int  NA 1 0 1 NA 0 1 0 1 1 ...
##  $ subject          : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ code             : chr  NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

2.

The main reason why your data.table may contain different values from mine is that the data comes from an experiment that is open to the public and run online. This means that the data is not static. It is constantly being added to as new people participate in the experiment.
To demonstrate this, lets visualise the dates of the data that we have collected. To do this, we will first need to convert the date column – which is currenly of type “character” – to a POSIXct object (i.e., a date-time object). The following code does this (please include this in your submitted files).

# convert the date column to a POSIXct object
d[, date_time := as.POSIXct(date, format = "%Y-%m-%d_%Hh%M.%S.%OS")]

# visualise the distribution of dates remaining in our data
# (should not contain any dates after 2024-01-01)
ggplot(d, aes(x = date_time)) +
  geom_histogram(binwidth = 86400, color = "black", fill = "blue") +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y") +
  labs(title = "Histogram of Dates", x = "Date", y = "Frequency") +
  theme_minimal()

## Warning: Removed 223 rows containing non-finite outside the scale range
## (`stat_bin()`).

To ensure that you are able to replciate my results exactly we will filter out any data that was collected after 2024-01-01. The following code chunk achieves this (please include this in your submitted files):

# filter out any data that was collected after 2024-01-01
d <- d[date_time < as.POSIXct("2024-01-01")]

# visualise the distribution of dates remaining in our data
# (should not contain any dates after 2024-01-01)
ggplot(d, aes(x = date_time)) +
  geom_histogram(binwidth = 86400, color = "black", fill = "blue") +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y") +
  labs(title = "Histogram of Dates", x = "Date", y = "Frequency") +
  theme_minimal()

3.

We should now all have the exact same data in our data.table object and can move on to the next step.
Lets get a feel for the shapes and patterns of our data. We will start by looking at the distribution of response times in our data.

## Warning: Removed 422 rows containing non-finite outside the scale range
## (`stat_bin()`).

This plot shows us that we have some very extreme observations that almost certainly do not reflect the performance of engaged participants.
Lets have a look at a more constrained range to shed some light on the issue.

Here we see that almost no responses take longer than 2.5 seconds, and almost zero by 5 seconds. We also see that there is a rise in response times near zero at (unrealistically fast). The fast times are likely the result of people blindly pushing any key without really about the task.
The slow times are easy to deal with (see next bullet), but the fast times are not so trivial because we can’t know which of these come from engaged people taking a long time versus which of these come from people who are not at all engaged and are just pushing buttons.
Fortunately, it seems that these extreme data points are very rare, so they will likely be excluded simply by excluding the most exptreme 1% of response times.
After making this exclusion, the distribution of response times looks like this:

4.

The very fast response times probably correspond to participants ignoring the task and instead simply pressing the button as quickly as they can over and over so that the experiment will end quickly. This hypothesis predicts that some people just don’t try, and these people should have very poor accuracy. This might give us a sound way to catch and exclude this source of noise.
Create a new column in your data.table that encodes response accuracy. Compute the average response accuracy per subject as well as the average response time per subject.
The above step is very important. It is the step were we go from having multiple observations from each subject to having a single observation per subject. That is, we collapsed across trials.
Visual the result of these operations with hsitograms of response times and response accuracy.

We see that the suspicious fast response times are also present in this new data.table – although they do appear less prominant – which means that they may indeed be mostly coming from a subset of participants.
We can also see that some people failed to get any correct at all (0% accuracy). That may be remarkable in its own right, but for now, lets see if it helps us understand the source of the very fast response times. Exclude subjects that on average got less than 50% correct and remake the response time histogram.

That indeed seemed to play a role in the fast response time issue, so lets use this version of the data as we move forward.

5.

The classic Stroop finding is that people are slower and less accurate on incongruent trials (i.e., trials on which stimulus text and stimulus colour do not match) than they are on congruent trials. Is this what we see in our data?

It looks like we have the effect we thought we should see (i.e., a Stroop effect), although it is quite a bit smaller than I exected.

Homework 1

Author: Matthew J. Crossley

Last update: 26 April, 2024

Learning objectives

Instructions

Marking

0.

1.

2.

3.

4.

5.