Learning objectives

  • Learn how to wrange real data into a data.table and plot it using ggplot.

  • Get lots of hands-on practice with R and data.table and ggplot.

  • In particular, practice using the following functions:

    • ggplot()
    • geom_point()
    • geom_line()
    • geom_bar()
    • geom_hist()
    • geom_violin()
    • geom_boxplot()

Work through these practice exercises

  • It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here

  • These practice problems use the diamonds data set that is loaded automatically when you load ggplot2. By default it exists as a data.frame. You can convert it to a data.table with the following line:

  • For each practice problem, replicate each ggplot figure and data.table output shown below exactly.

1.

##      cut mean_carat range_carat
##    <ord>      <num>       <num>
## 1: Ideal   0.702837         3.3

2.

##           cut clarity     N
##         <ord>   <ord> <int>
##  1:     Ideal     SI2  2598
##  2:   Premium     SI1  3575
##  3:      Good     VS1   648
##  4:   Premium     VS2  3357
##  5:      Good     SI2  1081
##  6: Very Good    VVS2  1235
##  7: Very Good    VVS1   789
##  8: Very Good     SI1  3240
##  9:      Fair     VS2   261
## 10: Very Good     VS1  1775
## 11:      Good     SI1  1560
## 12:     Ideal     VS1  3589
## 13:   Premium     SI2  2949
## 14:   Premium      I1   205
## 15: Very Good     VS2  2591
## 16:   Premium     VS1  1989
## 17:     Ideal     SI1  4282
## 18:      Good     VS2   978
## 19: Very Good     SI2  2100
## 20:     Ideal    VVS2  2606
## 21:     Ideal    VVS1  2047
## 22:   Premium    VVS1   616
## 23:      Good    VVS1   186
## 24:   Premium    VVS2   870
## 25:      Fair     SI2   466
## 26:     Ideal     VS2  5071
## 27:      Good    VVS2   286
## 28: Very Good      I1    84
## 29:      Fair     SI1   408
## 30:     Ideal      IF  1212
## 31:      Fair      I1   210
## 32:   Premium      IF   230
## 33:      Fair    VVS1    17
## 34: Very Good      IF   268
## 35:      Fair     VS1   170
## 36:     Ideal      I1   146
## 37:      Good      I1    96
## 38:      Fair      IF     9
## 39:      Fair    VVS2    69
## 40:      Good      IF    71
##           cut clarity     N

3.

##           cut clarity median_cty range_cty
##         <ord>   <ord>      <num>     <int>
##  1:     Ideal     SI2      599.0       670
##  2:   Premium     SI1      709.0       672
##  3:      Good     VS1      616.0       662
##  4:   Premium     VS2      734.0       665
##  5:      Good     SI2      593.0       662
##  6: Very Good    VVS2      605.0       652
##  7: Very Good    VVS1      642.0       659
##  8: Very Good     SI1      660.5       656
##  9:      Fair     VS2      813.0       659
## 10: Very Good     VS1      640.0       660
## 11:      Good     SI1      606.0       655
## 12:     Ideal     VS1      716.0       659
## 13:   Premium     SI2      631.0       652
## 14:   Premium      I1      835.0       608
## 15: Very Good     VS2      650.0       644
## 16:   Premium     VS1      734.0       644
## 17:     Ideal     SI1      660.0       642
## 18:      Good     VS2      643.0       630
## 19: Very Good     SI2      585.0       614
## 20:     Ideal    VVS2      775.5       586
## 21:     Ideal    VVS1      815.0       583
## 22:   Premium    VVS1      803.0       583
## 23:      Good    VVS1      724.5       591
## 24:   Premium    VVS2      796.0       554
## 25:     Ideal     VS2      734.0       632
## 26: Very Good      I1      720.0       427
## 27:     Ideal      IF      886.0       531
## 28: Very Good      IF      865.0       630
## 29:      Fair     VS1      735.0       611
## 30:      Good    VVS2      645.0       621
## 31:   Premium      IF      895.0       455
## 32:      Fair     SI1      775.0       498
## 33:      Fair      I1      893.0       408
## 34:      Fair    VVS2      750.0       628
## 35:     Ideal      I1      530.0       558
## 36:      Good      IF      827.0       496
## 37:      Good      I1      497.0       570
## 38:      Fair     SI2      871.0       439
## 39:      Fair    VVS1      790.0       135
##           cut clarity median_cty range_cty

4.

T-maze learning example

In Tutorial 2 you performed a simple analysis of data from a T-maze learning experiment. We will now use the same dataset again, but this time focus on visualisation.

The data used in this tutorial can be downloaded from:

https://github.com/crossley/cogs2020/tree/main/tutorials

Download the repository (or just the tutorials folder) and make sure the data folder is located in your current working directory before continuing.


a. Repeat the data processing and analysis steps performed in Tutorial 2:

  • Load the data using fread
  • (For now) keep only Experiment 1
  • Keep only the columns cage_context, rat_id, and maze_run_time
  • Compute mean maze_run_time grouped by cage_context and rat_id
  • Store the result in a data.table with columns
cage_context | rat_id | maze_run_time_mean

You should already know how to do this from the previous tutorial. If not, revisit Tutorial 2 before continuing.

For the remainder of this tutorial, assume this processed data exists and is named:

ans_8d

b. A good first step in any analysis is to examine the distribution of the dependent variable.

Create a histogram of maze_run_time_mean.

library(ggplot2)

ggplot(ans_8d, aes(x = maze_run_time_mean)) +
  geom_histogram(bins = 30)

Questions to consider:

  • Is the distribution symmetric?
  • Are there extreme values?
  • Does the scale – i.e., the range of maze_run_time_mean values – make sense given the task?
  • Are there outliers (i.e., observed maze_run_time_mean values that do not reflect a genuine running or choice process but likely instead reflect some other process such as distraction or recording error)?

c. We are usually interested in whether conditions differ.

Create a boxplot comparing mean maze-run time across contexts.

ggplot(ans_8d, aes(x = cage_context, y = maze_run_time_mean)) +
  geom_boxplot()

Add points showing individual rats.

ggplot(ans_8d, aes(x = cage_context, y = maze_run_time_mean)) +
  geom_boxplot() +
  geom_jitter(width = 0.1)

Questions to consider:

  • Do the contexts look different? Do they have different medians? Different variability? Different distribution shapes?
  • Is there large variability between rats?
  • Are differences consistent or driven by outliers?

d. Finally, visualise the context means using a bar plot with error bars showing the standard error of the mean. Note that the standard error of the mean is one very common way to quantify variability in a sample. This will soon be covered extensively in lectures. For now, the key point to notice is that where a box plot or histogram explicitly shows the distribution of data, a bar plot with error bars instead shows only a summary of the data (i.e., the mean and standard error of the mean). The bar plot is also far more common in published papers. Is this for better or for worse?

First, create a data.table containing the mean and standard error of maze_run_time_mean for each cage_context. Name the columns mean_run and se_run.

ans_summary <- ans_8d[
  , .(mean_run = mean(maze_run_time_mean),
      se_run = sd(maze_run_time_mean) / sqrt(.N)),
  by = cage_context
]

Next, create the bar plot.

ggplot(ans_summary, aes(x = cage_context, y = mean_run)) +
  geom_bar(stat = "identity") +
  geom_errorbar(
    aes(ymin = mean_run - se_run,
        ymax = mean_run + se_run),
    width = 0.05
  )

Questions to consider:

  • Do the contexts look different? Do they have different means?
  • Are any differences in means you might see large relative to the variability in the data (i.e., the size of the error bars)?