Your ability to use R
at a basic level.
Your ability to use built-in R
help
documents.
Your ability to wrangle data at a basic level in R
using the data.table
package.
Your ability to interact with data.table
objects to
report simple descriptive statistics.
Your ability to use data.table
objects and
ggplot
to illustrate basic trends in data.
Your ability to perform basic file I/O and the corresponding data wrangling that ensues.
Your ability to tie descriptive statistics as numbers to data visualisations.
Write a .Rmd
script that replicates each
ggplot
figure and data.table
output shown
below exactly.
When you are finished, knit your .Rmd
to a
.pdf
file being careful to set eval=T
and
echo=T
for each code chunk so that both your code syntax
and plots / outut are visible. Please submit the resulting
.pdf
file through ilearn.
This homework is structured as walk-thru of an analysis of a simple experiment. Throughout, I write comments and ask prompting questions. Usually, these sorts of things are bulleted. In any case, they are there simply to guide you along my thought process, and in so doing, are there to give you hints that direct you towards methods that will help you reproduce my output and plots.
Again, you are marked on your ability to reproduce my plots and output, and also perhaps on your ability to explain and defend your code should it be flagged as seeming too heavily influenced by ChatGPT. It is fine, by the way, to have even all your code written by ChatGPT, but only if it produces a deep understanding of your own code. You may lose marks if you write code that you do not understand.
20% of your overall mark for this assessment will be determined from whether or not your code runs without error. If it does not, then the full 20% will be given. If it does not, then you will lose this 20%. No partial credit will be given here.
40% of your overall mark for this assessment will be determined
from whether or not you replicate the ggplot
figures
exactly as shown below (or as exactly as possible given the note below).
If you get close but are not exact (some core concept is missing), then
you will be awarded half of the total marks for that problem. All
ggplot
figures are equally weighted with respect to
marks.
40% of your overall mark for this assessment will be determined
from whether or not you replicate the data.table
output
exactly as shown below (or as exactly as possible given the note below).
If you get close but are not exact (some core concept is missing), then
you will be awarded half of the total marks for that problem. All
data.table
output is weighted equally with respect to
marks.
data.table
, ggplot2
, and
ggpubr
libraries and remove any lingering variables from
your R
session memory.library(data.table)
library(ggplot2)
library(ggpubr)
rm(list=ls())
This problem set uses data from an online experiment located
here: https://run.pavlovia.org/demos/stroop/
You should participate in the experiment so that you get a good feel for what it consists of.
You can get the data for this experiment here:
https://gitlab.pavlovia.org/demos/stroop
Download the entire data directory, wrangle the data into a
single data.table
, and add an column named
subject
that indicates from what subject each observation
was taken.
Your data.table
should look very similar to the
following. Please note that at this stage it is okay if your numbers are
a bit different from what is displayed here. We will have more to say
about this in the next section.
# Load the `data.table` and `ggplot2` libraries and `rm` to
# be sure your are starting with a clean work space.
library(data.table)
library(ggplot2)
rm(list = ls())
# Use the `list.files` function to save a list of file paths
# corresponding to your just downloaded data files. Here, I
# have written `'data/stroop'` because that is where I put
# the data that I downloaded. You will need change this to
# something appropriate to where you put these files.
file_list <- list.files('Your_file_path_here',
full.names=T,
pattern='.csv$' # select only .csv files
)
# Here, we have used the `pattern` argument of the
# `list.files` function to select only the files in our data
# directory that have a `.csv` in their file name. The
# trailing `$` is saying that the `.csv` bit needs to come
# at the end of the file name. However, we will still have
# to deal with the fact that some of these csv files are
# empty.
# create an empty list with as many elements as we have data
# files
d_list <- vector("list", length(file_list))
# iterate through our list of data files and read each into
# a data.table, which we then store in d_list
for(i in 1:length(file_list)) {
# Check the file size, and if it's too small to be actual
# data, then skip it
if( file.size(file_list[[i]]) > 5){
z <- fread(file_list[[i]])
z[, subject := i]
d_list[[i]] <- z
}
}
# bind our list of data.tables into one big data.table
d <- rbindlist(d_list, fill=T)
# ask how many unique subjects we have (which for us in this
# example is equivalent to the number of unique files you've
# loaded)
# d[, length(unique(subject))]
str(d)
The main reason why your data.table
may contain
different values from mine is that the data comes from an experiment
that is open to the public and run online. This means that the data is
not static. It is constantly being added to as new people participate in
the experiment.
To demonstrate this, lets visualise the dates of the data that we
have collected. To do this, we will first need to convert the
date
column – which is currenly of type “character” – to a
POSIXct
object (i.e., a date-time object). The following
code does this (please include this in your submitted files).
# convert the date column to a POSIXct object
d[, date_time := as.POSIXct(date, format = "%Y-%m-%d_%Hh%M.%S.%OS")]
# visualise the distribution of dates remaining in our data
# (should not contain any dates after 2024-01-01)
ggplot(d, aes(x = date_time)) +
geom_histogram(binwidth = 86400, color = "black", fill = "blue") +
scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y") +
labs(title = "Histogram of Dates", x = "Date", y = "Frequency") +
theme_minimal()
# filter out any data that was collected after 2024-01-01
d <- d[date_time < as.POSIXct("2024-01-01")]
# visualise the distribution of dates remaining in our data
# (should not contain any dates after 2024-01-01)
ggplot(d, aes(x = date_time)) +
geom_histogram(binwidth = 86400, color = "black", fill = "blue") +
scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y") +
labs(title = "Histogram of Dates", x = "Date", y = "Frequency") +
theme_minimal()
We should now all have the exact same data in our
data.table
object and can move on to the next
step.
Lets get a feel for the shapes and patterns of our data. We will start by looking at the distribution of response times in our data.
ggplot(d, aes(resp.rt)) +
geom_histogram(binwidth=0.01) +
labs(title = "Distribution of Response Time",
x = "Response Time",
y = "Count") +
theme_minimal()
This plot shows us that we have some very extreme observations that almost certainly do not reflect the performance of engaged participants.
Lets have a look at a more constrained range to shed some light on the issue.
ggplot(d[resp.rt < 10], aes(resp.rt)) +
geom_histogram(binwidth=0.01) +
labs(title = "Distribution of Response Time",
x = "Response Time",
y = "Count") +
theme_minimal()
Here we see that almost no responses take longer than 2.5 seconds, and almost zero by 5 seconds. We also see that there is a rise in response times near zero at (unrealistically fast). The fast times are likely the result of people blindly pushing any key without really about the task.
The slow times are easy to deal with (see next bullet), but the fast times are not so trivial because we can’t know which of these come from engaged people taking a long time versus which of these come from people who are not at all engaged and are just pushing buttons.
Fortunately, it seems that these extreme data points are very rare, so they will likely be excluded simply by excluding the most exptreme 1% of response times.
After making this exclusion, the distribution of response times looks like this:
d <- d[resp.rt < quantile(resp.rt, 0.99, na.rm=T)]
ggplot(d, aes(resp.rt)) +
geom_histogram(binwidth=0.01) +
labs(title = "Distribution of Response Time",
x = "Response Time",
y = "Count") +
theme_minimal()
The very fast response times probably correspond to participants ignoring the task and instead simply pressing the button as quickly as they can over and over so that the experiment will end quickly. This hypothesis predicts that some people just don’t try, and these people should have very poor accuracy. This might give us a sound way to catch and exclude this source of noise.
Create a new column in your data.table
that encodes
response accuracy. Compute the average response accuracy per subject as
well as the average response time per subject.
The above step is very important. It is the step were we go from having multiple observations from each subject to having a single observation per subject. That is, we collapsed across trials.
Visual the result of these operations with hsitograms of response times and response accuracy.
# first create a column to indicate response accuracy
d[, acc := corrAns == resp.keys]
dd <- d[, .(mean_acc = mean(acc), mean_rt = mean(resp.rt)), .(subject)]
g_rt <- ggplot(dd, aes(x=mean_rt)) +
geom_histogram(binwidth=0.025) +
labs(title = "Distribution of Response Time",
x = "Mean Response Time",
y = "Count") +
theme_minimal()
g_acc <- ggplot(dd, aes(x=mean_acc)) +
geom_histogram(binwidth=0.05) +
labs(title = "Distribution of Response Accuracy",
x = "Mean Response Accuracy",
y = "Count") +
theme_minimal()
ggarrange(g_rt, g_acc, ncol=2)
We see that the suspicious fast response times are also present
in this new data.table
– although they do appear less
prominant – which means that they may indeed be mostly coming from a
subset of participants.
We can also see that some people failed to get any correct at all (0% accuracy). That may be remarkable in its own right, but for now, lets see if it helps us understand the source of the very fast response times. Exclude subjects that on average got less than 50% correct and remake the response time histogram.
exc_subs <- dd[mean_acc < 0.5, subject]
d <- d[!(subject %in% exc_subs)]
dd <- dd[!(subject %in% exc_subs)]
g_rt <- ggplot(dd, aes(x=mean_rt)) +
geom_histogram(binwidth=0.025) +
labs(title = "Distribution of Response Time",
x = "Mean Response Time",
y = "Count") +
theme_minimal()
g_acc <- ggplot(dd, aes(x=mean_acc)) +
geom_histogram(binwidth=0.05) +
labs(title = "Distribution of Response Accuracy",
x = "Mean Response Accuracy",
y = "Count") +
theme_minimal()
ggarrange(g_rt, g_acc, ncol=2)
dd <- d[, .(mean_acc = mean(acc), mean_rt = mean(resp.rt)), .(subject, congruent)]
dd <- dd[!(subject %in% exc_subs)]
g1 <- ggplot(dd, aes(x=mean_rt, fill=factor(congruent))) +
geom_histogram(bins=50) +
labs(title = "Distribution of Mean Response Time",
x = "Mean Response Time",
y = "Frequency",
fill = "Congruency") +
theme_minimal() +
theme(legend.position = "top",
legend.box.spacing = unit(0.2, "cm"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
plot.title = element_text(hjust = 0.5))
g2 <- ggplot(dd, aes(mean_acc, fill=factor(congruent))) +
geom_histogram(bins=20) +
labs(title = "Distribution of Mean Response Accuracy",
x = "Mean Accuracy",
y = "Frequency",
fill = "Congruency") +
theme_minimal() +
theme(legend.position = "top",
legend.box.spacing = unit(0.2, "cm"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
plot.title = element_text(hjust = 0.5))
g3 <- ggplot(dd, aes(x=as.factor(congruent), y=mean_rt)) +
geom_boxplot() +
labs(
x = "Congruency",
y = "Mean Response Time",
fill = "Congruency") +
theme_minimal()
g4 <- ggplot(dd, aes(x=as.factor(congruent), y=mean_acc)) +
geom_boxplot() +
labs(
x = "Congruency",
y = "Mean Response Accuracy",
fill = "Congruency") +
theme_minimal()
ggarrange(g1, g2, g3, g4, ncol=2, nrow=2)