Your ability to use R
at a basic level.
Your ability to use built-in R
help
documents.
Your ability to wrangle data at a basic level in R
using the data.table
package.
Your ability to interact with data.table
objects to
report simple descriptive statistics.
Your ability to use data.table
objects and
ggplot
to illustrate basic trends in data.
Your ability to perform basic file I/O and the corresponding data wrangling that ensues.
Your ability to tie descriptive statistics as numbers to data visualisations.
Write a .Rmd
script that replicates each
ggplot
figure and data.table
output shown
below exactly.
When you are finished, knit your .Rmd
to a
.pdf
file being careful to set eval=T
and
echo=T
for each code chunk so that both your code syntax
and plots / outut are visible. Please submit the resulting
.pdf
file through ilearn.
This homework is structured as walk-thru of an analysis of a simple experiment. Throughout, I write comments and ask prompting questions. Usually, these sorts of things are bulleted. In any case, they are there simply to guide you along my thought process, and in so doing, are there to give you hints that direct you towards methods that will help you reproduce my output and plots.
Again, you are marked on your ability to reproduce my plots and output, and also perhaps on your ability to explain and defend your code should it be flagged as seeming too heavily influenced by ChatGPT. It is fine, by the way, to have even all your code written by ChatGPT, but only if it produces a deep understanding of your own code. You may lose marks if you write code that you do not understand.
20% of your overall mark for this assessment will be determined from whether or not your code runs without error. If it does not, then the full 20% will be given. If it does not, then you will lose this 20%. No partial credit will be given here.
40% of your overall mark for this assessment will be determined
from whether or not you replicate the ggplot
figures
exactly as shown below (or as exactly as possible given the note below).
If you get close but are not exact (some core concept is missing), then
you will be awarded half of the total marks for that problem. All
ggplot
figures are equally weighted with respect to
marks.
40% of your overall mark for this assessment will be determined
from whether or not you replicate the data.table
output
exactly as shown below (or as exactly as possible given the note below).
If you get close but are not exact (some core concept is missing), then
you will be awarded half of the total marks for that problem. All
data.table
output is weighted equally with respect to
marks.
data.table
, ggplot2
, and
ggpubr
libraries and remove any lingering variables from
your R
session memory.library(data.table)
library(ggplot2)
library(ggpubr)
rm(list=ls())
This problem set uses data from an online experiment located
here: https://run.pavlovia.org/demos/stroop/
You should participate in the experiment so that you get a good feel for what it consists of.
You can get the data for this experiment here:
https://gitlab.pavlovia.org/demos/stroop
Download the entire data directory, wrangle the data into a
single data.table
, and add an column named
subject
that indicates from what subject each observation
was taken.
Your data.table
should look very similar to the
following. Please note that at this stage it is okay if your numbers are
a bit different from what is displayed here. We will have more to say
about this in the next section.
## Classes 'data.table' and 'data.frame': 12024 obs. of 22 variables:
## $ participant : chr "0" "0" "0" "0" ...
## $ session : int 1 1 1 1 1 1 1 1 1 1 ...
## $ date : chr "2023-12-20_17h02.42.844" "2023-12-20_17h02.42.844" "2023-12-20_17h02.42.844" "2023-12-20_17h02.42.844" ...
## $ expName : chr "stroop" "stroop" "stroop" "stroop" ...
## $ psychopyVersion : chr "2023.1.3" "2023.1.3" "2023.1.3" "2023.1.3" ...
## $ OS : chr "Win32" "Win32" "Win32" "Win32" ...
## $ frameRate : num 60.2 60.2 60.2 60.2 40 ...
## $ resp.keys : chr "" "left" "right" "right" ...
## $ resp.corr : int NA 0 1 0 NA 1 1 1 1 1 ...
## $ resp.rt : num NA 14.741 0.477 0.311 NA ...
## $ resp.duration : num NA NA NA NA NA NA NA NA NA NA ...
## $ trials.thisRepN : int NA 0 0 0 NA 0 0 0 0 0 ...
## $ trials.thisTrialN: int NA 0 1 2 NA 0 1 2 3 4 ...
## $ trials.thisN : int NA 0 1 2 NA 0 1 2 3 4 ...
## $ trials.thisIndex : int NA 4 3 2 NA 3 0 1 2 4 ...
## $ trials.ran : int NA 1 1 1 NA 1 1 1 1 1 ...
## $ text : chr "" "blue" "green" "green" ...
## $ letterColor : chr "" "blue" "blue" "green" ...
## $ corrAns : chr "" "right" "right" "down" ...
## $ congruent : int NA 1 0 1 NA 0 1 0 1 1 ...
## $ subject : int 1 1 1 1 2 2 2 2 2 2 ...
## $ code : chr NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
The main reason why your data.table
may contain
different values from mine is that the data comes from an experiment
that is open to the public and run online. This means that the data is
not static. It is constantly being added to as new people participate in
the experiment.
To demonstrate this, lets visualise the dates of the data that we
have collected. To do this, we will first need to convert the
date
column – which is currenly of type “character” – to a
POSIXct
object (i.e., a date-time object). The following
code does this (please include this in your submitted files).
# convert the date column to a POSIXct object
d[, date_time := as.POSIXct(date, format = "%Y-%m-%d_%Hh%M.%S.%OS")]
# visualise the distribution of dates remaining in our data
# (should not contain any dates after 2024-01-01)
ggplot(d, aes(x = date_time)) +
geom_histogram(binwidth = 86400, color = "black", fill = "blue") +
scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y") +
labs(title = "Histogram of Dates", x = "Date", y = "Frequency") +
theme_minimal()
## Warning: Removed 223 rows containing non-finite outside the scale range
## (`stat_bin()`).
# filter out any data that was collected after 2024-01-01
d <- d[date_time < as.POSIXct("2024-01-01")]
# visualise the distribution of dates remaining in our data
# (should not contain any dates after 2024-01-01)
ggplot(d, aes(x = date_time)) +
geom_histogram(binwidth = 86400, color = "black", fill = "blue") +
scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y") +
labs(title = "Histogram of Dates", x = "Date", y = "Frequency") +
theme_minimal()
We should now all have the exact same data in our
data.table
object and can move on to the next
step.
Lets get a feel for the shapes and patterns of our data. We will start by looking at the distribution of response times in our data.
## Warning: Removed 422 rows containing non-finite outside the scale range
## (`stat_bin()`).
This plot shows us that we have some very extreme observations that almost certainly do not reflect the performance of engaged participants.
Lets have a look at a more constrained range to shed some light on the issue.
Here we see that almost no responses take longer than 2.5 seconds, and almost zero by 5 seconds. We also see that there is a rise in response times near zero at (unrealistically fast). The fast times are likely the result of people blindly pushing any key without really about the task.
The slow times are easy to deal with (see next bullet), but the fast times are not so trivial because we can’t know which of these come from engaged people taking a long time versus which of these come from people who are not at all engaged and are just pushing buttons.
Fortunately, it seems that these extreme data points are very rare, so they will likely be excluded simply by excluding the most exptreme 1% of response times.
After making this exclusion, the distribution of response times looks like this:
The very fast response times probably correspond to participants ignoring the task and instead simply pressing the button as quickly as they can over and over so that the experiment will end quickly. This hypothesis predicts that some people just don’t try, and these people should have very poor accuracy. This might give us a sound way to catch and exclude this source of noise.
Create a new column in your data.table
that encodes
response accuracy. Compute the average response accuracy per subject as
well as the average response time per subject.
The above step is very important. It is the step were we go from having multiple observations from each subject to having a single observation per subject. That is, we collapsed across trials.
Visual the result of these operations with hsitograms of response times and response accuracy.
We see that the suspicious fast response times are also present
in this new data.table
– although they do appear less
prominant – which means that they may indeed be mostly coming from a
subset of participants.
We can also see that some people failed to get any correct at all (0% accuracy). That may be remarkable in its own right, but for now, lets see if it helps us understand the source of the very fast response times. Exclude subjects that on average got less than 50% correct and remake the response time histogram.