COGS2020-WEEK2-TUTORIAL

Learning objectives

Get lots of hands-on practice with data.table :)

You should pay very careful attention to learn about: * data.table chaining * Adding columns using := * Selecting rows with i, operating on columns with j, and grouping with by using data.table syntax [i, j, by]

Work through these practice exercises

# load data.table library
#install.packages("data.table") # only if you have not installed data.table before

library(data.table)

# clean session
rm(list=ls())

# Please include the following line of code as well
set.seed(0)

# create vectors from which to build d
V1 <- rep(month.abb[1:8], 2) #month.abb is a built in vector
V2 <- rep(letters[1:8], 2) #letters is a built in vector
V3 <- rnorm(length(V1))

# create a data.table named d
d <- data.table(month=V1, event=V2, obs=V3)

Reminder on data.table syntax dt[i,j,by] datatable[row, column, grouping]

1. Select rows

A). Use the data.table keyword .N to report how many rows are in d and store the result in a variable named ans_1a.

B). Return a data.table containing only the rows of d for which month=="Feb" and event=="c" and store the result in a variable named ans_1b.

C). Using the %in% operator, return a data.table containing only the rows of d for which month is in the vector c("Jan", "Mar", "May", "Jul"), and store the result in a variable named ans_1c.

2. Select and operate on columns

A). Return a vector containing the month column of d and store the result in a variable named ans_2a.

B). Return a data.table containing only the month and event columns of d and store the result in a variable named ans_2b.

C). Return the mean and standard deviation of the obs column across all rows in d and store the result in a data.table named ans_2c. The resulting data.table should have 1 observation with columns named obs_mean and obs_sd.

datatable[row, column, group] ## 3. Group by

A). Return the mean obs grouped by event and grouped by month and return the result in a data.table named ans_3a. Name the column of ans_3a that contains the mean of the obs column obs_mean.

4. [i, j, by] or alternatively [row, column, grouping]

A). Select the first 8 rows of d and compute the median value of the obs column grouped by month and event and return the result in a data.table named ans_4a. Name the column of ans_4a that contains the median of the obs column obs_median.

5. Add / remove columns, copy, etc.

D <- data.table(EmployeeID = 101:105, 
                    Department = c("HR", "IT", "Finance", "Marketing", "Sales"))

A). Add a new column called Salary with values (50000, 60000, 55000, 52000, 58000)

B). Increase salary by 10%

C). Create a new column called Bonus such that: If Salary is greater than or equal to 55000, Bonus should be “High” Otherwise, Bonus should be “Low”

D). create a new column EmployeesPerDept that stores the number of employees in each Department.

6. Joins in data.table

products <- data.table(ProductID = 101:105, 
                       ProductName = c("Laptop", "Phone", "Tablet", "Monitor", "Keyboard"))
sales <- data.table(ProductID = c(102, 103, 105, 106), 
                    UnitsSold = c(50, 30, 20, 40))

A). Perform a left join to merge products with sales by ProductID, ensuring all rows from products are retained. Store the result in ans_6a.

B). Perform a right join to merge products with sales, ensuring all rows from sales are retained. Store the result in ans_6b.

7. Reshaping data.table

D <- data.table(Student = c("A", "B", "C"), Math = c(85, 78, 92), Science = c(90, 82, 95), English = c(88, 80, 94))

A). Convert d from wide format to long format, where: Student remains as the identifier column (id.vars). The column names Math, Science, and English should be stored in a column called “Subject”. The values should be stored in a column named “Score”. Store the result in ans_7a.

8. Chaining

DT <- data.table(
  ID = 1:5,
  Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
  Score = c(85, 92, 88, 94, 90)
)

A). Sort DT by Score in descending order. Select the top 3 rows. Compute the maximum score in these rows and store it in a new column/datatable “TopScore”. Store the result in ans_8a.

B). Filter DT to include only rows where Score >= 90. Select only the columns Name and Score. Sort the result in descending order of Score. Store the result in ans_8b

C). Compute the mean Score of all employees, individually. Create a new column AdjustedScore where each score is increased by 5% of the mean score. Store the result in ans_8c.

D). Group DT by Name. Compute the average score for each person. Sort the result in descending order of AverageScore. Store the result in ans_8d