ggplot2
Learn how to wrange real data into a data.table
and
plot it using ggplot
.
Get lots of hands-on practice with R
and
data.table
and ggplot
.
In particular, practice using the following functions:
ggplot()
geom_point()
geom_line()
geom_bar()
geom_hist()
geom_violin()
geom_boxplot()
Complete the experiment located at the following address:
https://run.pavlovia.org/demos/simplertt/
Be sure to enter your name or some other ID that you will remember and can be easily searched for.
Download your data in a .csv
file by taking the
following steps:
Go here:
https://gitlab.pavlovia.org/demos/simplertt
Read about the experiment you just participated in by scrolling
to the bottom and reading the README.md
file.
Click the data
folder to land at the following
address:
https://gitlab.pavlovia.org/demos/simplertt/tree/master/data
Near the top right of the page, click the Find file
button, and search for a file containing the unique ID that you entered
at the beginning of the experiment.
Click on the .csv
file that pops up.
Finally, click the download
button to download your
.csv
file to your local machine.
data.table
data.table
library and rm
to be
sure your are starting with a clean work space.library(data.table)
library(ggplot2)
rm(list = ls())
data.table
using the
fread
function from the data.table
library.# You need to replace the path I use here with a path that
# points to wherever you have your data stored.
d <- fread('https://crossley.github.io/book_stats/data/real_data/simpleRTT/data_mjc_simpleRTT_2022-02-26_05h13.22.csv')
We can see that d
has quite a few columns by
printing it to the console (which I won’t do here because the output is
very ugly and I want to keep things clean). You, however, should do it,
so you can see what I mean.
When data.table
objects have lots of columns,
str
can be a good summary function to use for basic
inspection.
str(d)
## Classes 'data.table' and 'data.frame': 28 obs. of 26 variables:
## $ response.keys : chr "space" "space" "space" "space" ...
## $ response.corr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ response.rt : num 0.306 0.273 0.283 0.317 0.3 0.286 0.272 0.3 0.308 0.284 ...
## $ mouseResp.x : num 0.228 0.228 0.228 0.228 0.228 ...
## $ mouseResp.y : num -0.334 -0.334 -0.334 -0.334 -0.334 ...
## $ mouseResp.leftButton : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mouseResp.midButton : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mouseResp.rightButton : int 0 0 0 0 0 0 0 0 0 0 ...
## $ practiceTrials.thisRepN : int 0 0 0 0 0 0 0 0 NA NA ...
## $ practiceTrials.thisTrialN: int 0 1 2 3 4 5 6 7 NA NA ...
## $ practiceTrials.thisN : int 0 1 2 3 4 5 6 7 NA NA ...
## $ practiceTrials.thisIndex : int 1 7 3 2 5 6 4 0 NA NA ...
## $ practiceTrials.ran : int 1 1 1 1 1 1 1 1 NA NA ...
## $ isi : num 1.22 2.56 1.67 1.44 2.11 ...
## $ participant : chr "mjc" "mjc" "mjc" "mjc" ...
## $ session : int 1 1 1 1 1 1 1 1 1 1 ...
## $ date : chr "2022-02-26_05h13.22" "2022-02-26_05h13.22" "2022-02-26_05h13.22" "2022-02-26_05h13.22" ...
## $ expName : chr "simpleRTT" "simpleRTT" "simpleRTT" "simpleRTT" ...
## $ psychopyVersion : chr "2021.2.3" "2021.2.3" "2021.2.3" "2021.2.3" ...
## $ OS : chr "MacIntel" "MacIntel" "MacIntel" "MacIntel" ...
## $ frameRate : num 58.8 58.8 58.8 58.8 58.8 ...
## $ mainTrials.thisRepN : int NA NA NA NA NA NA NA NA 0 0 ...
## $ mainTrials.thisTrialN : int NA NA NA NA NA NA NA NA 0 1 ...
## $ mainTrials.thisN : int NA NA NA NA NA NA NA NA 0 1 ...
## $ mainTrials.thisIndex : int NA NA NA NA NA NA NA NA 5 8 ...
## $ mainTrials.ran : int NA NA NA NA NA NA NA NA 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
It’s certainly difficult to know what all of these columns encode. This is something you will get used to as you build and run your own experiments (e.g., as you will in later COGS units). For now, I’ll just tell you. The data contains a row for every trial completed, including practice trials.
You can tell which rows correspond to practice trials and which
correspond to experiment trials by examining the columns named
practiceTrials.thisN
and
mainTrials.thisN
.
d[, .(practiceTrials.thisN, mainTrials.thisN)]
## practiceTrials.thisN mainTrials.thisN
## <int> <int>
## 1: 0 NA
## 2: 1 NA
## 3: 2 NA
## 4: 3 NA
## 5: 4 NA
## 6: 5 NA
## 7: 6 NA
## 8: 7 NA
## 9: NA 0
## 10: NA 1
## 11: NA 2
## 12: NA 3
## 13: NA 4
## 14: NA 5
## 15: NA 6
## 16: NA 7
## 17: NA 8
## 18: NA 9
## 19: NA 10
## 20: NA 11
## 21: NA 12
## 22: NA 13
## 23: NA 14
## 24: NA 15
## 25: NA 16
## 26: NA 17
## 27: NA 18
## 28: NA 19
## practiceTrials.thisN mainTrials.thisN
We can see that mainTrials.thisN
is NA
during practice and practiceTrials.thisN
is NA
during the main experiment.
We can also see that our reaction time per trial is stored in a
column named response.rt
and the inter-stimulus interval
(isi) is stored in a column named isi
.
response.rt
is our main dependent
variable of interest, and isi
is an
independent variable.
A dependent variable is a variable or outcome not controlled by the experimenter but rather observed in the experiment.
An independent variable is a variable or experiment factor controlled by the experimenter that may influence observations from a dependent variable.
You can also see that both response.RT
and
isi
are continuous (as opposed to
discrete, or categorical etc.) observations numbers bounded between zero
and positive infinity.
We will talk about these key terms in more depth coming up in lecture.
Okay, we are now equipped to pull out just the rows and columns that we need for a simple exploration of our performance.
# We begin by just looking and the main trials
d_main <- d[!is.na(mainTrials.thisN), .(response.rt, isi)]
# We don't have enough data points for a histogram to look
# very good, but it is generally a good place to start so
# you get a feel for the shape of your data.
ggplot(d_main, aes(x=response.rt)) +
geom_histogram(bins=30)
# The histogram shows that between 0.2 and 0.3 ms our
# observations seem a bit bell-shaped. However, we can
# clearly see that a response time is never negative and is
# sometimes quite far away from the peak of the bell shape
# (where most of the observations are located).
# From the histogram, you can eye-ball a reasonable guess
# about the central tendency and spread of this sample, but
# it would also be very reasonable to dig a bit deeper with
# some additional plots.
ggplot(d_main, aes(x=0, y=response.rt)) +
geom_boxplot() +
geom_point() +
xlim(-1, 1)
# Finally, we report basic descriptive statistics as actual
# numbers for the main experiment
d_main[, .(rt_mean=mean(response.rt),
rt_median=median(response.rt),
rt_sd=sd(response.rt))]
## rt_mean rt_median rt_sd
## <num> <num> <num>
## 1: 0.31745 0.269 0.1530361
# Report basic descriptive statistics for the main
# experiment using `chaining` to avoid creating a new
# data.table
d[!is.na(mainTrials.thisN), .(response.rt, isi)][, .(rt_mean=mean(response.rt),
rt_median=median(response.rt),
rt_sd=sd(response.rt))]
## rt_mean rt_median rt_sd
## <num> <num> <num>
## 1: 0.31745 0.269 0.1530361
# You should check these numbers against the plots you've
# made to ensure they are all consistent with each other.
d
corresponds to a practice trial or to a main trial.d[, phase := "Unknown"]
d[!is.na(practiceTrials.thisN), phase := "practice"]
d[!is.na(mainTrials.thisN), phase := "main"]
d[, .(phase, response.rt)]
## phase response.rt
## <char> <num>
## 1: practice 0.306
## 2: practice 0.273
## 3: practice 0.283
## 4: practice 0.317
## 5: practice 0.300
## 6: practice 0.286
## 7: practice 0.272
## 8: practice 0.300
## 9: main 0.308
## 10: main 0.284
## 11: main 0.322
## 12: main 0.267
## 13: main 0.284
## 14: main 0.266
## 15: main 0.271
## 16: main 0.251
## 17: main 0.241
## 18: main 0.267
## 19: main 0.302
## 20: main 0.260
## 21: main 0.250
## 22: main 0.300
## 23: main 0.334
## 24: main 0.266
## 25: main 0.236
## 26: main 0.440
## 27: main 0.939
## 28: main 0.261
## phase response.rt
group
descriptive statistics as follows:d[, .(rt_mean=mean(response.rt),
rt_median=median(response.rt),
rt_sd=sd(response.rt)),
.(phase)]
## phase rt_mean rt_median rt_sd
## <char> <num> <num> <num>
## 1: practice 0.292125 0.293 0.01615494
## 2: main 0.317450 0.269 0.15303611
ggplot(d, aes(x=phase, y=response.rt)) +
geom_boxplot()
ggplot(d, aes(x=isi, y=response.rt)) +
geom_point()
It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here
These practice problems use the diamonds
data set
that is loaded automatically when you load ggplot2
. By
default it exists as a data.frame
. You can convert it to a
data.table
with the following line:
For each practice problem, replicate each ggplot
figure and data.table
output shown below
exactly.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## cut mean_carat range_carat
## <ord> <num> <num>
## 1: Ideal 0.702837 3.3
## cut clarity N
## <ord> <ord> <int>
## 1: Ideal SI2 2598
## 2: Premium SI1 3575
## 3: Good VS1 648
## 4: Premium VS2 3357
## 5: Good SI2 1081
## 6: Very Good VVS2 1235
## 7: Very Good VVS1 789
## 8: Very Good SI1 3240
## 9: Fair VS2 261
## 10: Very Good VS1 1775
## 11: Good SI1 1560
## 12: Ideal VS1 3589
## 13: Premium SI2 2949
## 14: Premium I1 205
## 15: Very Good VS2 2591
## 16: Premium VS1 1989
## 17: Ideal SI1 4282
## 18: Good VS2 978
## 19: Very Good SI2 2100
## 20: Ideal VVS2 2606
## 21: Ideal VVS1 2047
## 22: Premium VVS1 616
## 23: Good VVS1 186
## 24: Premium VVS2 870
## 25: Fair SI2 466
## 26: Ideal VS2 5071
## 27: Good VVS2 286
## 28: Very Good I1 84
## 29: Fair SI1 408
## 30: Ideal IF 1212
## 31: Fair I1 210
## 32: Premium IF 230
## 33: Fair VVS1 17
## 34: Very Good IF 268
## 35: Fair VS1 170
## 36: Ideal I1 146
## 37: Good I1 96
## 38: Fair IF 9
## 39: Fair VVS2 69
## 40: Good IF 71
## cut clarity N
## cut clarity median_cty range_cty
## <ord> <ord> <num> <int>
## 1: Ideal SI2 599.0 670
## 2: Premium SI1 709.0 672
## 3: Good VS1 616.0 662
## 4: Premium VS2 734.0 665
## 5: Good SI2 593.0 662
## 6: Very Good VVS2 605.0 652
## 7: Very Good VVS1 642.0 659
## 8: Very Good SI1 660.5 656
## 9: Fair VS2 813.0 659
## 10: Very Good VS1 640.0 660
## 11: Good SI1 606.0 655
## 12: Ideal VS1 716.0 659
## 13: Premium SI2 631.0 652
## 14: Premium I1 835.0 608
## 15: Very Good VS2 650.0 644
## 16: Premium VS1 734.0 644
## 17: Ideal SI1 660.0 642
## 18: Good VS2 643.0 630
## 19: Very Good SI2 585.0 614
## 20: Ideal VVS2 775.5 586
## 21: Ideal VVS1 815.0 583
## 22: Premium VVS1 803.0 583
## 23: Good VVS1 724.5 591
## 24: Premium VVS2 796.0 554
## 25: Ideal VS2 734.0 632
## 26: Very Good I1 720.0 427
## 27: Ideal IF 886.0 531
## 28: Very Good IF 865.0 630
## 29: Fair VS1 735.0 611
## 30: Good VVS2 645.0 621
## 31: Premium IF 895.0 455
## 32: Fair SI1 775.0 498
## 33: Fair I1 893.0 408
## 34: Fair VVS2 750.0 628
## 35: Ideal I1 530.0 558
## 36: Good IF 827.0 496
## 37: Good I1 497.0 570
## 38: Fair SI2 871.0 439
## 39: Fair VVS1 790.0 135
## cut clarity median_cty range_cty