Reshaping Data with data.table

2024

Introduction to Data Reshaping

Reshaping data involves changing how your data is organised in whatever sort of container you are using to store it. For us, this container will almost always be a data.table Luckily, the data.table package in R provides efficient functions for reshaping.

Wide Format data.table Example

## Warning: package 'data.table' was built under R version 4.3.3

##    Subject Congruent Incongruent
##      <num>     <num>       <num>
## 1:       1     0.512       0.710
## 2:       2     0.469       0.689
## 3:       3     0.482       0.725

Each column represents a different variable or condition (e.g., “Congruent” and “Incongruent” response times).

Long Format data.table Example

##    Subject   Condition ResponseTime
##      <num>      <fctr>        <num>
## 1:       1   Congruent        0.512
## 2:       2   Congruent        0.469
## 3:       3   Congruent        0.482
## 4:       1 Incongruent        0.710
## 5:       2 Incongruent        0.689
## 6:       3 Incongruent        0.725

Subject: Identifier for each subject.
Condition: Specifies the condition under which the measurement was taken (e.g., “Congruent” or “Incongruent”).
ResponseTime: The measured response time for the condition.

Why long or wide format?

Wide format can look bad.

##    subject congruent_red_squares congruent_red_circles congruent_green_squares
##      <num>                 <num>                 <num>                   <num>
## 1:       1            0.09036462             0.4576625               0.8395739
## 2:       2            0.90471847             0.2707995               0.4854740
## 3:       3            0.48763414             0.8476232               0.1277650
##    congruent_green_circles incongruent_red_squares incongruent_red_circles
##                      <num>                   <num>                   <num>
## 1:               0.3321140               0.7497325              0.05748557
## 2:               0.5492948               0.3272924              0.54439835
## 3:               0.5230726               0.5345912              0.46163305
##    incongruent_green_squares incongruent_green_circles
##                        <num>                     <num>
## 1:                 0.5594327                 0.3080047
## 2:                 0.6651799                 0.7455875
## 3:                 0.7163730                 0.8415654

Why long or wide format?

Long format looks better.

##     subject                 Condition ResponseTime
##       <num>                    <fctr>        <num>
##  1:       1     congruent_red_squares   0.63172725
##  2:       2     congruent_red_squares   0.77623095
##  3:       3     congruent_red_squares   0.89293911
##  4:       1     congruent_red_circles   0.57121216
##  5:       2     congruent_red_circles   0.68884069
##  6:       3     congruent_red_circles   0.26426613
##  7:       1   congruent_green_squares   0.99682401
##  8:       2   congruent_green_squares   0.88973048
##  9:       3   congruent_green_squares   0.60057390
## 10:       1   congruent_green_circles   0.63776503
## 11:       2   congruent_green_circles   0.19114320
## 12:       3   congruent_green_circles   0.95576341
## 13:       1   incongruent_red_squares   0.59653883
## 14:       2   incongruent_red_squares   0.72418131
## 15:       3   incongruent_red_squares   0.07083491
## 16:       1   incongruent_red_circles   0.38905937
## 17:       2   incongruent_red_circles   0.09588949
## 18:       3   incongruent_red_circles   0.30960923
## 19:       1 incongruent_green_squares   0.11316197
## 20:       2 incongruent_green_squares   0.66240201
## 21:       3 incongruent_green_squares   0.07481482
## 22:       1 incongruent_green_circles   0.14808223
## 23:       2 incongruent_green_circles   0.85760981
## 24:       3 incongruent_green_circles   0.89932586
##     subject                 Condition ResponseTime

Why long or wide format?

Long format makes grouping / aggregation much easier and scalable.

# compute mean ResponseTime per Condition from long format
long_dt[, mean(ResponseTime), .(Condition)]

# compute mean ResponseTime per Condition from wide format
wide_dt[, .( mean(congruent_red_squares), 
            mean(congruent_red_circles), 
            mean(congruent_green_squares), 
            mean(congruent_green_circles), 
            mean(incongruent_red_squares), 
            mean(incongruent_red_circles), 
            mean(incongruent_green_squares), 
            mean(incongruent_green_circles)
            )]

`melt()` for wide to long

library(data.table)

wide_dt <- data.table(
  Subject = c(1, 2, 3),
  Congruent = c(0.512, 0.469, 0.482),
  Incongruent = c(0.710, 0.689, 0.725)
)

# Convert from wide to long format
long_dt <- melt(wide_dt, 
                id.vars = "Subject", 
                variable.name = "Condition",
                value.name = "ResponseTime")

The id.vars parameter specifies columns that should remain unchanged, while variable.name and value.name parameters rename the melted columns.

`dcast()` for long to wide

# Convert back to wide format for comparison
wide_dt_again <- dcast(long_dt, 
                       Subject ~ Condition, 
                       value.var = "ResponseTime")
print(wide_dt_again)

## Key: <Subject>
##    Subject Congruent Incongruent
##      <num>     <num>       <num>
## 1:       1     0.512       0.710
## 2:       2     0.469       0.689
## 3:       3     0.482       0.725

long_dt: The long format input data.table
Subject ~ Condition: Group by Subject and spread columns out named after each Condition
value.var = "ResponseTime": Specifies which column to fill values from.

Question 1

Given dt_sales below, which line of code correctly converts dt_sales into a long format?

   Product Q1_Sales Q2_Sales
1:       A      150      200
2:       B      100      150

melt(dt_sales, id.vars = "Product", measure.vars = c("Q1_Sales", "Q2_Sales"))
dcast(dt_sales, Product ~ Q1_Sales + Q2_Sales, value.var = "Sales")
dt_sales[, .(Product, Q1_Sales, Q2_Sales)]
melt(dt_sales, id.vars = "Q1_Sales", measure.vars = "Product")

Click here for the answer

  melt(dt_sales, id.vars = "Product", measure.vars = c("Q1_Sales", "Q2_Sales"))`

Question 2

Given dt_long_sales in long format which line of code correctly converts it to wide format, where each row represents a product and each quarter’s sales are in separate columns?

   Product   Quarter Sales
1:       A   Q1_Sales  150
2:       A   Q2_Sales  200
3:       B   Q1_Sales  100
4:       B   Q2_Sales  150

dcast(dt_long_sales, Product ~ Sales, value.var = "Quarter")
dcast(dt_long_sales, Product ~ Quarter, value.var = "Sales")
dt_long_sales[, .(Product, Quarter, Sales)]
dcast(dt_long_sales, Quarter ~ Product, value.var = "Sales")

Click here for the answer

  dcast(dt_long_sales, Product ~ Quarter, value.var = "Sales")

Introduction to Data Reshaping

Wide Format data.table Example

Long Format data.table Example

Why long or wide format?

Why long or wide format?

Why long or wide format?

melt() for wide to long

dcast() for long to wide

Question 1

Question 2

`melt()` for wide to long

`dcast()` for long to wide