Get lots of hands-on practice with ggplot2
. Introduction
to data visualisation and interpretation
You should pay very careful attention to learn about: * Different types of graphs * Aesthetic/customisation options
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.3
library(ggplot2)
#Load in built-in data as a data.table
iris <- as.data.table(iris)
#Take a look at the data
#Some questions to ask yourself about the data:
# What are our variables?
# How many variables (eg. columns)?
# What type of variables (eg. continuous, factor, etc.)?
A). Create 2 separate histograms to look at the distribution of data for sepal length and sepal width. Set binwidth to 0.05.
#Histogram for sepal length
ggplot(data = iris, aes(x = Sepal.Length)) +
geom_histogram(binwidth = 0.05)
#Histogram for sepal height
ggplot(data = iris, aes(x = Sepal.Width)) +
geom_histogram(binwidth = 0.05)
B). Colour the Sepal Length and Sepal Width histograms based on
Species (hint: do we use colour =
or fill =
?)
What do you notice about the plot? How does height and width vary
depending on Species?
# Histogram for sepal length
ggplot(data = iris, aes(x = Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.05)
#Setosa species overall seem to be shorter in sepal length compared to Versicolor and Virginica
# Histogram for sepal height
ggplot(data = iris, aes(x = Sepal.Width, fill = Species)) +
geom_histogram(binwidth = 0.05)
#Setosa species overall seem to be wider in sepal width compared to versicolor and virginica
C). We want to explore if there is any relationship between the sepal length and sepal width in the iris dataset. Make a scatterplot to see whether there is a potential trend between the 2 variables. Does anything in the plot stand out to you?
ggplot(data = iris, aes(x = Sepal.Width, y = Sepal.Length)) +
geom_point()
#There are 2 main groups of data - could this separation potentially be due to another factor?
D). Colour the plot based on species. What do you notice about how the sepal length and width for different species?
ggplot(data = iris, aes(x = Sepal.Width, y = Sepal.Length, colour = Species)) +
geom_point()
#The Setosa species appears to be shorter and longer than the other species. However, we can not be completely certain whether this difference is significant or not without running statistical tests.
E). Add a linear trendline for each of the species
ggplot(data = iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
A). How many observations do we have for each Species of iris? Create a bar graph to visualise this information.
ggplot(data = iris, aes(x = Species)) +
geom_bar()
B). Create a boxplot graph to visualise the petal length of each Species. What do you notice?
ggplot(data = iris, aes(x = Species, y = Petal.Length)) +
geom_boxplot()
#Virginica petals are the longest, followed by Versicolor, then Setosa
#Setosa petals seem to be much shorter compared to the other 2
#But, cannot determine whether differences are significant or not without statistical testing
C). Create a violin graph to visualise the petal length of each Species. What else do you notice?
ggplot(data = iris, aes(x = Species, y = Petal.Length)) +
geom_violin()
#Setosa petal length is more centralised compared to the others (which are more spread out)
A). Create a scatterplot where the x axis represents sepal length, the y axis represents sepal width. Give the points a colour gradient according to petal length, whereby shorter lengths are blue and longer are red. Ensure the graph is labelled appropriately (eg. title, axes labels, legend label). Finally, change the theme of the graph to classic.
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Petal.Length)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Sepal Dimensions Colored by Petal Length",
x = "Sepal Width",
y = "Sepal Length",
colour = "Petal Length") +
theme_classic()
B). Using the graph in the previous question, manually adjust the graph theme by changing the legend position to be below the graph and rotating the x axis tick marks by 45 deg.
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Petal.Length)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Sepal Dimensions Colored by Petal Length",
x = "Sepal Width",
y = "Sepal Length",
colour = "Petal Length") +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45))
C). First, create a histogram of sepal length. We then want to zoom in on a specific range of sepal length to focus on data points within that range. Customise the axis limits to zoom in on data that falls within c(6, 8). Adjust the binwidth accordingly. Ensure the graph is labelled appropriately (eg. title, axes labels, legend label).
ggplot(data = iris, aes(x = Sepal.Length)) +
geom_histogram(binwidth = 0.05) +
scale_x_continuous(limits = c(6, 8)) +
labs(title = "Distribution of Sepal Length",
x = "Sepal Length",
y = "Frequency")
## Warning: Removed 83 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale
## range (`geom_bar()`).
# Can also use xlim(), difference is that xlim() doesn't plot data outside of the limits at all, while scale_x_continuous() still plots but rescales the axis to only show the specified range
A).We want to compare the relationship between Petal Length and Petal Width across different species. Using facet_wrap() to visualise, how does the relationship between petal length and width vary depending on species?
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point() +
facet_wrap(~Species) +
labs(title = "Petal Dimensions by Species",
x = "Petal Length",
y = "Petal Width")
Ensure all your code will be displayed in the output (hint: check {r})
A). Create a histogram on petal length only for the versicolor species. Adjust binwidth and add labels accordingly.
ggplot(data = iris[Species == "versicolor"], aes(x = Petal.Length)) +
geom_histogram(binwidth = 0.05) +
labs(title = "Distribution of Petal Length for Versicolor ",
x = "Petal Length",
y = "Frequency")
B). Create a boxplot graph on sepal length only for the setosa and virginica species. Add labels accordingly.
ggplot(data = iris[Species %in% c("setosa", "virginica")], aes(x = Species, y = Sepal.Length)) +
geom_boxplot() +
labs(title = "Distribution of Petal Length for Virginica and Setosa",
x = "Species",
y = "Petal Length")