2024

Introduction to Data Reshaping

Reshaping data involves changing how your data is organised in whatever sort of container you are using to store it. For us, this container will almost always be a data.table Luckily, the data.table package in R provides efficient functions for reshaping.

Wide Format data.table Example

##    Subject Congruent Incongruent
##      <num>     <num>       <num>
## 1:       1     0.512       0.710
## 2:       2     0.469       0.689
## 3:       3     0.482       0.725
  • Each column represents a different variable or condition (e.g., “Congruent” and “Incongruent” response times).

Long Format data.table Example

##    Subject   Condition ResponseTime
##      <num>      <fctr>        <num>
## 1:       1   Congruent        0.512
## 2:       2   Congruent        0.469
## 3:       3   Congruent        0.482
## 4:       1 Incongruent        0.710
## 5:       2 Incongruent        0.689
## 6:       3 Incongruent        0.725
  • Subject: Identifier for each subject.
  • Condition: Specifies the condition under which the measurement was taken (e.g., “Congruent” or “Incongruent”).
  • ResponseTime: The measured response time for the condition.

Why long or wide format?

Wide format can look bad.

##    subject congruent_red_squares congruent_red_circles congruent_green_squares
##      <num>                 <num>                 <num>                   <num>
## 1:       1             0.5639383             0.1690465               0.3205858
## 2:       2             0.3716122             0.6094975               0.2782768
## 3:       3             0.9054757             0.4053170               0.2706870
##    congruent_green_circles incongruent_red_squares incongruent_red_circles
##                      <num>                   <num>                   <num>
## 1:              0.83545595               0.4734518               0.3437299
## 2:              0.04074369               0.3332120               0.3652781
## 3:              0.20165058               0.9635328               0.6204369
##    incongruent_green_squares incongruent_green_circles
##                        <num>                     <num>
## 1:                 0.1981469                 0.1624775
## 2:                 0.2732871                 0.1931977
## 3:                 0.9353818                 0.4349281

Why long or wide format?

Long format looks better.

##     subject                 Condition ResponseTime
##       <num>                    <fctr>        <num>
##  1:       1     congruent_red_squares   0.57028445
##  2:       2     congruent_red_squares   0.76218085
##  3:       3     congruent_red_squares   0.05483344
##  4:       1     congruent_red_circles   0.83374341
##  5:       2     congruent_red_circles   0.42364944
##  6:       3     congruent_red_circles   0.43459494
##  7:       1   congruent_green_squares   0.61919366
##  8:       2   congruent_green_squares   0.37556849
##  9:       3   congruent_green_squares   0.60765499
## 10:       1   congruent_green_circles   0.03361356
## 11:       2   congruent_green_circles   0.29743561
## 12:       3   congruent_green_circles   0.56835194
## 13:       1   incongruent_red_squares   0.08571122
## 14:       2   incongruent_red_squares   0.79735083
## 15:       3   incongruent_red_squares   0.43901876
## 16:       1   incongruent_red_circles   0.01041793
## 17:       2   incongruent_red_circles   0.84409133
## 18:       3   incongruent_red_circles   0.32319304
## 19:       1 incongruent_green_squares   0.86643055
## 20:       2 incongruent_green_squares   0.72170326
## 21:       3 incongruent_green_squares   0.26434592
## 22:       1 incongruent_green_circles   0.75671073
## 23:       2 incongruent_green_circles   0.89298880
## 24:       3 incongruent_green_circles   0.21561802
##     subject                 Condition ResponseTime

Why long or wide format?

Long format makes grouping / aggregation much easier and scalable.

# compute mean ResponseTime per Condition from long format
long_dt[, mean(ResponseTime), .(Condition)]

# compute mean ResponseTime per Condition from wide format
wide_dt[, .( mean(congruent_red_squares), 
            mean(congruent_red_circles), 
            mean(congruent_green_squares), 
            mean(congruent_green_circles), 
            mean(incongruent_red_squares), 
            mean(incongruent_red_circles), 
            mean(incongruent_green_squares), 
            mean(incongruent_green_circles)
            )]

melt() for wide to long

library(data.table)

wide_dt <- data.table(
  Subject = c(1, 2, 3),
  Congruent = c(0.512, 0.469, 0.482),
  Incongruent = c(0.710, 0.689, 0.725)
)

# Convert from wide to long format
long_dt <- melt(wide_dt, 
                id.vars = "Subject", 
                variable.name = "Condition",
                value.name = "ResponseTime")
  • The id.vars parameter specifies columns that should remain unchanged, while variable.name and value.name parameters rename the melted columns.

dcast() for long to wide

# Convert back to wide format for comparison
wide_dt_again <- dcast(long_dt, 
                       Subject ~ Condition, 
                       value.var = "ResponseTime")
print(wide_dt_again)
## Key: <Subject>
##    Subject Congruent Incongruent
##      <num>     <num>       <num>
## 1:       1     0.512       0.710
## 2:       2     0.469       0.689
## 3:       3     0.482       0.725
  • long_dt: The long format input data.table

  • Subject ~ Condition: Group by Subject and spread columns out named after each Condition

  • value.var = "ResponseTime": Specifies which column to fill values from.

Question 1

Given dt_sales below, which line of code correctly converts dt_sales into a long format?

   Product Q1_Sales Q2_Sales
1:       A      150      200
2:       B      100      150
melt(dt_sales, id.vars = "Product", measure.vars = c("Q1_Sales", "Q2_Sales"))
dcast(dt_sales, Product ~ Q1_Sales + Q2_Sales, value.var = "Sales")
dt_sales[, .(Product, Q1_Sales, Q2_Sales)]
melt(dt_sales, id.vars = "Q1_Sales", measure.vars = "Product")
Click here for the answer
  melt(dt_sales, id.vars = "Product", measure.vars = c("Q1_Sales", "Q2_Sales"))`

Question 2

Given dt_long_sales in long format which line of code correctly converts it to wide format, where each row represents a product and each quarter’s sales are in separate columns?

   Product   Quarter Sales
1:       A   Q1_Sales  150
2:       A   Q2_Sales  200
3:       B   Q1_Sales  100
4:       B   Q2_Sales  150
dcast(dt_long_sales, Product ~ Sales, value.var = "Quarter")
dcast(dt_long_sales, Product ~ Quarter, value.var = "Sales")
dt_long_sales[, .(Product, Quarter, Sales)]
dcast(dt_long_sales, Quarter ~ Product, value.var = "Sales")
Click here for the answer
  dcast(dt_long_sales, Product ~ Quarter, value.var = "Sales")