Tutorial 3 - Visualise data with ggplot2

Learning objectives

Learn how to wrange real data into a data.table and plot it using ggplot.
Get lots of hands-on practice with R and data.table and ggplot.
In particular, practice using the following functions:
- ggplot()
- geom_point()
- geom_line()
- geom_bar()
- geom_hist()
- geom_violin()
- geom_boxplot()

Get some experiment data

Complete the experiment located at the following address: https://run.pavlovia.org/demos/simplertt/
Be sure to enter your name or some other ID that you will remember and can be easily searched for.
Download your data in a .csv file by taking the following steps:
Go here: https://gitlab.pavlovia.org/demos/simplertt
Read about the experiment you just participated in by scrolling to the bottom and reading the README.md file.
Click the data folder to land at the following address: https://gitlab.pavlovia.org/demos/simplertt/tree/master/data
Near the top right of the page, click the Find file button, and search for a file containing the unique ID that you entered at the beginning of the experiment.
Click on the .csv file that pops up.
Finally, click the download button to download your .csv file to your local machine.

Analyse your data using `data.table`

Load the data.table library and rm to be sure your are starting with a clean work space.

library(data.table)
library(ggplot2)

rm(list = ls())

Load the data into a data.table using the fread function from the data.table library.

# You need to replace the path I use here with a path that
# points to wherever you have your data stored.
d <- fread('https://crossley.github.io/book_stats/data/real_data/simpleRTT/data_mjc_simpleRTT_2022-02-26_05h13.22.csv')

We can see that d has quite a few columns by printing it to the console (which I won’t do here because the output is very ugly and I want to keep things clean). You, however, should do it, so you can see what I mean.
When data.table objects have lots of columns, str can be a good summary function to use for basic inspection.

str(d)
## Classes 'data.table' and 'data.frame':   28 obs. of  26 variables:
##  $ response.keys            : chr  "space" "space" "space" "space" ...
##  $ response.corr            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ response.rt              : num  0.306 0.273 0.283 0.317 0.3 0.286 0.272 0.3 0.308 0.284 ...
##  $ mouseResp.x              : num  0.228 0.228 0.228 0.228 0.228 ...
##  $ mouseResp.y              : num  -0.334 -0.334 -0.334 -0.334 -0.334 ...
##  $ mouseResp.leftButton     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mouseResp.midButton      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mouseResp.rightButton    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ practiceTrials.thisRepN  : int  0 0 0 0 0 0 0 0 NA NA ...
##  $ practiceTrials.thisTrialN: int  0 1 2 3 4 5 6 7 NA NA ...
##  $ practiceTrials.thisN     : int  0 1 2 3 4 5 6 7 NA NA ...
##  $ practiceTrials.thisIndex : int  1 7 3 2 5 6 4 0 NA NA ...
##  $ practiceTrials.ran       : int  1 1 1 1 1 1 1 1 NA NA ...
##  $ isi                      : num  1.22 2.56 1.67 1.44 2.11 ...
##  $ participant              : chr  "mjc" "mjc" "mjc" "mjc" ...
##  $ session                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ date                     : chr  "2022-02-26_05h13.22" "2022-02-26_05h13.22" "2022-02-26_05h13.22" "2022-02-26_05h13.22" ...
##  $ expName                  : chr  "simpleRTT" "simpleRTT" "simpleRTT" "simpleRTT" ...
##  $ psychopyVersion          : chr  "2021.2.3" "2021.2.3" "2021.2.3" "2021.2.3" ...
##  $ OS                       : chr  "MacIntel" "MacIntel" "MacIntel" "MacIntel" ...
##  $ frameRate                : num  58.8 58.8 58.8 58.8 58.8 ...
##  $ mainTrials.thisRepN      : int  NA NA NA NA NA NA NA NA 0 0 ...
##  $ mainTrials.thisTrialN    : int  NA NA NA NA NA NA NA NA 0 1 ...
##  $ mainTrials.thisN         : int  NA NA NA NA NA NA NA NA 0 1 ...
##  $ mainTrials.thisIndex     : int  NA NA NA NA NA NA NA NA 5 8 ...
##  $ mainTrials.ran           : int  NA NA NA NA NA NA NA NA 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

It’s certainly difficult to know what all of these columns encode. This is something you will get used to as you build and run your own experiments (e.g., as you will in later COGS units). For now, I’ll just tell you. The data contains a row for every trial completed, including practice trials.
You can tell which rows correspond to practice trials and which correspond to experiment trials by examining the columns named practiceTrials.thisN and mainTrials.thisN.

d[, .(practiceTrials.thisN, mainTrials.thisN)]
##     practiceTrials.thisN mainTrials.thisN
##                    <int>            <int>
##  1:                    0               NA
##  2:                    1               NA
##  3:                    2               NA
##  4:                    3               NA
##  5:                    4               NA
##  6:                    5               NA
##  7:                    6               NA
##  8:                    7               NA
##  9:                   NA                0
## 10:                   NA                1
## 11:                   NA                2
## 12:                   NA                3
## 13:                   NA                4
## 14:                   NA                5
## 15:                   NA                6
## 16:                   NA                7
## 17:                   NA                8
## 18:                   NA                9
## 19:                   NA               10
## 20:                   NA               11
## 21:                   NA               12
## 22:                   NA               13
## 23:                   NA               14
## 24:                   NA               15
## 25:                   NA               16
## 26:                   NA               17
## 27:                   NA               18
## 28:                   NA               19
##     practiceTrials.thisN mainTrials.thisN

We can see that mainTrials.thisN is NA during practice and practiceTrials.thisN is NA during the main experiment.
We can also see that our reaction time per trial is stored in a column named response.rt and the inter-stimulus interval (isi) is stored in a column named isi.
response.rt is our main dependent variable of interest, and isi is an independent variable.
A dependent variable is a variable or outcome not controlled by the experimenter but rather observed in the experiment.
An independent variable is a variable or experiment factor controlled by the experimenter that may influence observations from a dependent variable.
You can also see that both response.RT and isi are continuous (as opposed to discrete, or categorical etc.) observations numbers bounded between zero and positive infinity.
We will talk about these key terms in more depth coming up in lecture.
Okay, we are now equipped to pull out just the rows and columns that we need for a simple exploration of our performance.

# We begin by just looking and the main trials
d_main <- d[!is.na(mainTrials.thisN), .(response.rt, isi)]

# We don't have enough data points for a histogram to look
# very good, but it is generally a good place to start so
# you get a feel for the shape of your data.
ggplot(d_main, aes(x=response.rt)) + 
  geom_histogram(bins=30)


# The histogram shows that between 0.2 and 0.3 ms our
# observations seem a bit bell-shaped. However, we can
# clearly see that a response time is never negative and is
# sometimes quite far away from the peak of the bell shape
# (where most of the observations are located).

# From the histogram, you can eye-ball a reasonable guess
# about the central tendency and spread of this sample, but
# it would also be very reasonable to dig a bit deeper with
# some additional plots.
ggplot(d_main, aes(x=0, y=response.rt)) + 
  geom_boxplot() +
  geom_point() +
  xlim(-1, 1)


# Finally, we report basic descriptive statistics as actual
# numbers for the main experiment
d_main[, .(rt_mean=mean(response.rt), 
           rt_median=median(response.rt), 
           rt_sd=sd(response.rt))]
##    rt_mean rt_median     rt_sd
##      <num>     <num>     <num>
## 1: 0.31745     0.269 0.1530361

# Report basic descriptive statistics for the main
# experiment using `chaining` to avoid creating a new
# data.table
d[!is.na(mainTrials.thisN), .(response.rt, isi)][, .(rt_mean=mean(response.rt), 
                                                     rt_median=median(response.rt), 
                                                     rt_sd=sd(response.rt))]
##    rt_mean rt_median     rt_sd
##      <num>     <num>     <num>
## 1: 0.31745     0.269 0.1530361

# You should check these numbers against the plots you've
# made to ensure they are all consistent with each other.

Report basic descriptive statistics for practice and for the main experiment at the same time. To do this, it will be very helpful to have a new single column that indicates whether a particular row id d corresponds to a practice trial or to a main trial.

d[, phase := "Unknown"]
d[!is.na(practiceTrials.thisN), phase := "practice"]
d[!is.na(mainTrials.thisN), phase := "main"]

d[, .(phase, response.rt)]
##        phase response.rt
##       <char>       <num>
##  1: practice       0.306
##  2: practice       0.273
##  3: practice       0.283
##  4: practice       0.317
##  5: practice       0.300
##  6: practice       0.286
##  7: practice       0.272
##  8: practice       0.300
##  9:     main       0.308
## 10:     main       0.284
## 11:     main       0.322
## 12:     main       0.267
## 13:     main       0.284
## 14:     main       0.266
## 15:     main       0.271
## 16:     main       0.251
## 17:     main       0.241
## 18:     main       0.267
## 19:     main       0.302
## 20:     main       0.260
## 21:     main       0.250
## 22:     main       0.300
## 23:     main       0.334
## 24:     main       0.266
## 25:     main       0.236
## 26:     main       0.440
## 27:     main       0.939
## 28:     main       0.261
##        phase response.rt

The reason having this new column helps so much is because we can now use it to group descriptive statistics as follows:

d[, .(rt_mean=mean(response.rt), 
      rt_median=median(response.rt), 
      rt_sd=sd(response.rt)), 
  .(phase)]
##       phase  rt_mean rt_median      rt_sd
##      <char>    <num>     <num>      <num>
## 1: practice 0.292125     0.293 0.01615494
## 2:     main 0.317450     0.269 0.15303611

It also facilitates plotting comparisons as follows:

ggplot(d, aes(x=phase, y=response.rt)) + 
  geom_boxplot()

From here, we can ask if there is any relationship between our dependent and our independent variables. When looking for relationships between two variables, one of the first things we usually do is plot one of them on the x-axis and the other of them on the y-axis.

ggplot(d, aes(x=isi, y=response.rt)) +
  geom_point()

No obvious relationship jumps out to my eye. You?

Work through these practice exercises

It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here
These practice problems use the diamonds data set that is loaded automatically when you load ggplot2. By default it exists as a data.frame. You can convert it to a data.table with the following line:
For each practice problem, replicate each ggplot figure and data.table output shown below exactly.

1.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##      cut mean_carat range_carat
##    <ord>      <num>       <num>
## 1: Ideal   0.702837         3.3

2.

##           cut clarity     N
##         <ord>   <ord> <int>
##  1:     Ideal     SI2  2598
##  2:   Premium     SI1  3575
##  3:      Good     VS1   648
##  4:   Premium     VS2  3357
##  5:      Good     SI2  1081
##  6: Very Good    VVS2  1235
##  7: Very Good    VVS1   789
##  8: Very Good     SI1  3240
##  9:      Fair     VS2   261
## 10: Very Good     VS1  1775
## 11:      Good     SI1  1560
## 12:     Ideal     VS1  3589
## 13:   Premium     SI2  2949
## 14:   Premium      I1   205
## 15: Very Good     VS2  2591
## 16:   Premium     VS1  1989
## 17:     Ideal     SI1  4282
## 18:      Good     VS2   978
## 19: Very Good     SI2  2100
## 20:     Ideal    VVS2  2606
## 21:     Ideal    VVS1  2047
## 22:   Premium    VVS1   616
## 23:      Good    VVS1   186
## 24:   Premium    VVS2   870
## 25:      Fair     SI2   466
## 26:     Ideal     VS2  5071
## 27:      Good    VVS2   286
## 28: Very Good      I1    84
## 29:      Fair     SI1   408
## 30:     Ideal      IF  1212
## 31:      Fair      I1   210
## 32:   Premium      IF   230
## 33:      Fair    VVS1    17
## 34: Very Good      IF   268
## 35:      Fair     VS1   170
## 36:     Ideal      I1   146
## 37:      Good      I1    96
## 38:      Fair      IF     9
## 39:      Fair    VVS2    69
## 40:      Good      IF    71
##           cut clarity     N

3.

##           cut clarity median_cty range_cty
##         <ord>   <ord>      <num>     <int>
##  1:     Ideal     SI2      599.0       670
##  2:   Premium     SI1      709.0       672
##  3:      Good     VS1      616.0       662
##  4:   Premium     VS2      734.0       665
##  5:      Good     SI2      593.0       662
##  6: Very Good    VVS2      605.0       652
##  7: Very Good    VVS1      642.0       659
##  8: Very Good     SI1      660.5       656
##  9:      Fair     VS2      813.0       659
## 10: Very Good     VS1      640.0       660
## 11:      Good     SI1      606.0       655
## 12:     Ideal     VS1      716.0       659
## 13:   Premium     SI2      631.0       652
## 14:   Premium      I1      835.0       608
## 15: Very Good     VS2      650.0       644
## 16:   Premium     VS1      734.0       644
## 17:     Ideal     SI1      660.0       642
## 18:      Good     VS2      643.0       630
## 19: Very Good     SI2      585.0       614
## 20:     Ideal    VVS2      775.5       586
## 21:     Ideal    VVS1      815.0       583
## 22:   Premium    VVS1      803.0       583
## 23:      Good    VVS1      724.5       591
## 24:   Premium    VVS2      796.0       554
## 25:     Ideal     VS2      734.0       632
## 26: Very Good      I1      720.0       427
## 27:     Ideal      IF      886.0       531
## 28: Very Good      IF      865.0       630
## 29:      Fair     VS1      735.0       611
## 30:      Good    VVS2      645.0       621
## 31:   Premium      IF      895.0       455
## 32:      Fair     SI1      775.0       498
## 33:      Fair      I1      893.0       408
## 34:      Fair    VVS2      750.0       628
## 35:     Ideal      I1      530.0       558
## 36:      Good      IF      827.0       496
## 37:      Good      I1      497.0       570
## 38:      Fair     SI2      871.0       439
## 39:      Fair    VVS1      790.0       135
##           cut clarity median_cty range_cty

Tutorial 3 - Visualise data with `ggplot2`

Author: Matthew J. Crossley

Last update: 02 April, 2024

Learning objectives

Get some experiment data

Analyse your data using `data.table`

Work through these practice exercises

1.

2.

3.

4.

Tutorial 3 - Visualise data with ggplot2

Author: Matthew J. Crossley

Last update: 02 April, 2024

Learning objectives

Get some experiment data

Analyse your data using data.table

Work through these practice exercises

1.

2.

3.

4.

Tutorial 3 - Visualise data with `ggplot2`

Analyse your data using `data.table`