ggplot2Learn how to wrange real data into a data.table and
plot it using ggplot.
Get lots of hands-on practice with R and
data.table and ggplot.
In particular, practice using the following functions:
ggplot()geom_point()geom_line()geom_bar()geom_hist()geom_violin()geom_boxplot()It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here
These practice problems use the diamonds data set
that is loaded automatically when you load ggplot2. By
default it exists as a data.frame. You can convert it to a
data.table with the following line:
For each practice problem, replicate each ggplot
figure and data.table output shown below
exactly.
## cut mean_carat range_carat
## <ord> <num> <num>
## 1: Ideal 0.702837 3.3
## cut clarity N
## <ord> <ord> <int>
## 1: Ideal SI2 2598
## 2: Premium SI1 3575
## 3: Good VS1 648
## 4: Premium VS2 3357
## 5: Good SI2 1081
## 6: Very Good VVS2 1235
## 7: Very Good VVS1 789
## 8: Very Good SI1 3240
## 9: Fair VS2 261
## 10: Very Good VS1 1775
## 11: Good SI1 1560
## 12: Ideal VS1 3589
## 13: Premium SI2 2949
## 14: Premium I1 205
## 15: Very Good VS2 2591
## 16: Premium VS1 1989
## 17: Ideal SI1 4282
## 18: Good VS2 978
## 19: Very Good SI2 2100
## 20: Ideal VVS2 2606
## 21: Ideal VVS1 2047
## 22: Premium VVS1 616
## 23: Good VVS1 186
## 24: Premium VVS2 870
## 25: Fair SI2 466
## 26: Ideal VS2 5071
## 27: Good VVS2 286
## 28: Very Good I1 84
## 29: Fair SI1 408
## 30: Ideal IF 1212
## 31: Fair I1 210
## 32: Premium IF 230
## 33: Fair VVS1 17
## 34: Very Good IF 268
## 35: Fair VS1 170
## 36: Ideal I1 146
## 37: Good I1 96
## 38: Fair IF 9
## 39: Fair VVS2 69
## 40: Good IF 71
## cut clarity N
## cut clarity median_cty range_cty
## <ord> <ord> <num> <int>
## 1: Ideal SI2 599.0 670
## 2: Premium SI1 709.0 672
## 3: Good VS1 616.0 662
## 4: Premium VS2 734.0 665
## 5: Good SI2 593.0 662
## 6: Very Good VVS2 605.0 652
## 7: Very Good VVS1 642.0 659
## 8: Very Good SI1 660.5 656
## 9: Fair VS2 813.0 659
## 10: Very Good VS1 640.0 660
## 11: Good SI1 606.0 655
## 12: Ideal VS1 716.0 659
## 13: Premium SI2 631.0 652
## 14: Premium I1 835.0 608
## 15: Very Good VS2 650.0 644
## 16: Premium VS1 734.0 644
## 17: Ideal SI1 660.0 642
## 18: Good VS2 643.0 630
## 19: Very Good SI2 585.0 614
## 20: Ideal VVS2 775.5 586
## 21: Ideal VVS1 815.0 583
## 22: Premium VVS1 803.0 583
## 23: Good VVS1 724.5 591
## 24: Premium VVS2 796.0 554
## 25: Ideal VS2 734.0 632
## 26: Very Good I1 720.0 427
## 27: Ideal IF 886.0 531
## 28: Very Good IF 865.0 630
## 29: Fair VS1 735.0 611
## 30: Good VVS2 645.0 621
## 31: Premium IF 895.0 455
## 32: Fair SI1 775.0 498
## 33: Fair I1 893.0 408
## 34: Fair VVS2 750.0 628
## 35: Ideal I1 530.0 558
## 36: Good IF 827.0 496
## 37: Good I1 497.0 570
## 38: Fair SI2 871.0 439
## 39: Fair VVS1 790.0 135
## cut clarity median_cty range_cty
In Tutorial 2 you performed a simple analysis of data from a T-maze learning experiment. We will now use the same dataset again, but this time focus on visualisation.
The data used in this tutorial can be downloaded from:
https://github.com/crossley/cogs2020/tree/main/tutorials
Download the repository (or just the tutorials folder)
and make sure the data folder is located in your current
working directory before continuing.
a. Repeat the data processing and analysis steps performed in Tutorial 2:
freadcage_context,
rat_id, and maze_run_timemaze_run_time grouped by
cage_context and rat_iddata.table with columnscage_context | rat_id | maze_run_time_mean
You should already know how to do this from the previous tutorial. If not, revisit Tutorial 2 before continuing.
For the remainder of this tutorial, assume this processed data exists and is named:
ans_8d
b. A good first step in any analysis is to examine the distribution of the dependent variable.
Create a histogram of maze_run_time_mean.
library(ggplot2)
ggplot(ans_8d, aes(x = maze_run_time_mean)) +
geom_histogram(bins = 30)
Questions to consider:
maze_run_time_mean
values – make sense given the task?maze_run_time_mean
values that do not reflect a genuine running or choice process but
likely instead reflect some other process such as distraction or
recording error)?c. We are usually interested in whether conditions differ.
Create a boxplot comparing mean maze-run time across contexts.
ggplot(ans_8d, aes(x = cage_context, y = maze_run_time_mean)) +
geom_boxplot()
Add points showing individual rats.
ggplot(ans_8d, aes(x = cage_context, y = maze_run_time_mean)) +
geom_boxplot() +
geom_jitter(width = 0.1)
Questions to consider:
d. Finally, visualise the context means using a bar plot with error bars showing the standard error of the mean. Note that the standard error of the mean is one very common way to quantify variability in a sample. This will soon be covered extensively in lectures. For now, the key point to notice is that where a box plot or histogram explicitly shows the distribution of data, a bar plot with error bars instead shows only a summary of the data (i.e., the mean and standard error of the mean). The bar plot is also far more common in published papers. Is this for better or for worse?
First, create a data.table containing the mean and
standard error of maze_run_time_mean for each
cage_context. Name the columns mean_run and
se_run.
ans_summary <- ans_8d[
, .(mean_run = mean(maze_run_time_mean),
se_run = sd(maze_run_time_mean) / sqrt(.N)),
by = cage_context
]
Next, create the bar plot.
ggplot(ans_summary, aes(x = cage_context, y = mean_run)) +
geom_bar(stat = "identity") +
geom_errorbar(
aes(ymin = mean_run - se_run,
ymax = mean_run + se_run),
width = 0.05
)
Questions to consider: