dcasting a data table

Making a presentation table with data.table and gt R packages

Been a while since I blogged here. Where does time go? On Twitter, I realized it’s just about time for spooky plots:

Maybe too early?

In this post, I want to share a simple R code pattern that’s been useful for me. Using data.table’s dcast function to create a nice table. Together with gt you can make some decent stuff.

library(gt)
library(data.table)
library(tidyverse)

Get some data

Let’s get some data. How about house prices? Let’s get the Freddie Mac House Price Index.

# load data
dt <- fread("http://www.freddiemac.com/fmac-resources/research/docs/fmhpi_master_file.csv")

Create 12-month house price growth rates.

dt[,hpa_yoy:=Index_SA/shift(Index_SA,12)-1,.(GEO_Type,GEO_Code,GEO_Name)]

Make a table

Let’s focus on one area, Virginia (GEO_Name=“VA” in the file), and chart the annual house price growth rates since 2011.

dt_va<- dt[GEO_Name=="VA" & Year>2010,]
head(dt_va)
##    Year Month GEO_Type GEO_Name GEO_Code Index_NSA Index_SA     hpa_yoy
## 1: 2011     1    State       VA        .  151.2373 153.2461 -0.04462630
## 2: 2011     2    State       VA        .  151.2157 153.1026 -0.04435817
## 3: 2011     3    State       VA        .  152.4072 153.0405 -0.04148806
## 4: 2011     4    State       VA        .  154.2866 153.1896 -0.03705511
## 5: 2011     5    State       VA        .  155.9039 153.2191 -0.03329706
## 6: 2011     6    State       VA        .  156.6307 152.9682 -0.02927971

Now we could use dplyr::pivot_wider to make our table, but recently I’ve been using data.table.

dcast(dt_va,Year~Month,value.var="hpa_yoy")
##     Year            1            2            3            4            5
##  1: 2011 -0.044626301 -0.044358166 -0.041488056 -0.037055110 -0.033297061
##  2: 2012 -0.009463327 -0.007026538 -0.004857802 -0.002297582  0.001727123
##  3: 2013  0.044484003  0.050000433  0.053948845  0.055360465  0.056554423
##  4: 2014  0.053463682  0.049736967  0.046012042  0.041707616  0.036186832
##  5: 2015  0.019442867  0.015968104  0.013413606  0.013505313  0.015992844
##  6: 2016  0.024926148  0.028235691  0.031062403  0.032352299  0.032912416
##  7: 2017  0.032755180  0.033205562  0.035454685  0.040142128  0.042344583
##  8: 2018  0.045999614  0.049474007  0.045810536  0.039358855  0.033843593
##  9: 2019  0.038207896  0.036018656  0.036824415  0.039998350  0.043124212
## 10: 2020  0.051758901  0.054343291  0.051047192  0.045608782  0.041815160
##                6           7           8           9          10          11
##  1: -0.029279712 -0.02398674 -0.01685799 -0.01191744 -0.01109453 -0.01141400
##  2:  0.007268559  0.01405302  0.01969788  0.02615740  0.03565749  0.04115821
##  3:  0.058064410  0.05914242  0.05899473  0.05455433  0.04878094  0.04745076
##  4:  0.031178462  0.02682621  0.02406790  0.02385549  0.02417547  0.02486742
##  5:  0.018995642  0.02105085  0.02229313  0.02494160  0.02594623  0.02510597
##  6:  0.032605494  0.03054026  0.02928032  0.02968501  0.03149942  0.03314164
##  7:  0.042195974  0.04341698  0.04417151  0.04145725  0.03836104  0.03776823
##  8:  0.032785103  0.03358091  0.03739476  0.04082979  0.04374959  0.04432131
##  9:  0.044731015  0.04121706  0.04009124  0.03962240  0.03900515  0.04099818
## 10:  0.045235922  0.04844981          NA          NA          NA          NA
##              12
##  1: -0.01172155
##  2:  0.04241783
##  3:  0.05182596
##  4:  0.02285520
##  5:  0.02399759
##  6:  0.03312838
##  7:  0.04090315
##  8:  0.04198894
##  9:  0.04579517
## 10:          NA

Let’s clean it up a bit, converting months to month names, and rounding the values.

dt_va[,mname:=factor(month.abb[Month],levels=month.abb)]
dt_tab <- dcast(dt_va,Year~mname,value.var="hpa_yoy")
dt_tab[]
##     Year          Jan          Feb          Mar          Apr          May
##  1: 2011 -0.044626301 -0.044358166 -0.041488056 -0.037055110 -0.033297061
##  2: 2012 -0.009463327 -0.007026538 -0.004857802 -0.002297582  0.001727123
##  3: 2013  0.044484003  0.050000433  0.053948845  0.055360465  0.056554423
##  4: 2014  0.053463682  0.049736967  0.046012042  0.041707616  0.036186832
##  5: 2015  0.019442867  0.015968104  0.013413606  0.013505313  0.015992844
##  6: 2016  0.024926148  0.028235691  0.031062403  0.032352299  0.032912416
##  7: 2017  0.032755180  0.033205562  0.035454685  0.040142128  0.042344583
##  8: 2018  0.045999614  0.049474007  0.045810536  0.039358855  0.033843593
##  9: 2019  0.038207896  0.036018656  0.036824415  0.039998350  0.043124212
## 10: 2020  0.051758901  0.054343291  0.051047192  0.045608782  0.041815160
##              Jun         Jul         Aug         Sep         Oct         Nov
##  1: -0.029279712 -0.02398674 -0.01685799 -0.01191744 -0.01109453 -0.01141400
##  2:  0.007268559  0.01405302  0.01969788  0.02615740  0.03565749  0.04115821
##  3:  0.058064410  0.05914242  0.05899473  0.05455433  0.04878094  0.04745076
##  4:  0.031178462  0.02682621  0.02406790  0.02385549  0.02417547  0.02486742
##  5:  0.018995642  0.02105085  0.02229313  0.02494160  0.02594623  0.02510597
##  6:  0.032605494  0.03054026  0.02928032  0.02968501  0.03149942  0.03314164
##  7:  0.042195974  0.04341698  0.04417151  0.04145725  0.03836104  0.03776823
##  8:  0.032785103  0.03358091  0.03739476  0.04082979  0.04374959  0.04432131
##  9:  0.044731015  0.04121706  0.04009124  0.03962240  0.03900515  0.04099818
## 10:  0.045235922  0.04844981          NA          NA          NA          NA
##             Dec
##  1: -0.01172155
##  2:  0.04241783
##  3:  0.05182596
##  4:  0.02285520
##  5:  0.02399759
##  6:  0.03312838
##  7:  0.04090315
##  8:  0.04198894
##  9:  0.04579517
## 10:          NA

Now we can touch it up a bit with the gt package.

gt(dt_tab)
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 -0.044626301 -0.044358166 -0.041488056 -0.037055110 -0.033297061 -0.029279712 -0.02398674 -0.01685799 -0.01191744 -0.01109453 -0.01141400 -0.01172155
2012 -0.009463327 -0.007026538 -0.004857802 -0.002297582 0.001727123 0.007268559 0.01405302 0.01969788 0.02615740 0.03565749 0.04115821 0.04241783
2013 0.044484003 0.050000433 0.053948845 0.055360465 0.056554423 0.058064410 0.05914242 0.05899473 0.05455433 0.04878094 0.04745076 0.05182596
2014 0.053463682 0.049736967 0.046012042 0.041707616 0.036186832 0.031178462 0.02682621 0.02406790 0.02385549 0.02417547 0.02486742 0.02285520
2015 0.019442867 0.015968104 0.013413606 0.013505313 0.015992844 0.018995642 0.02105085 0.02229313 0.02494160 0.02594623 0.02510597 0.02399759
2016 0.024926148 0.028235691 0.031062403 0.032352299 0.032912416 0.032605494 0.03054026 0.02928032 0.02968501 0.03149942 0.03314164 0.03312838
2017 0.032755180 0.033205562 0.035454685 0.040142128 0.042344583 0.042195974 0.04341698 0.04417151 0.04145725 0.03836104 0.03776823 0.04090315
2018 0.045999614 0.049474007 0.045810536 0.039358855 0.033843593 0.032785103 0.03358091 0.03739476 0.04082979 0.04374959 0.04432131 0.04198894
2019 0.038207896 0.036018656 0.036824415 0.039998350 0.043124212 0.044731015 0.04121706 0.04009124 0.03962240 0.03900515 0.04099818 0.04579517
2020 0.051758901 0.054343291 0.051047192 0.045608782 0.041815160 0.045235922 0.04844981 NA NA NA NA NA

We can use gt::fmt_percent to format the percentages and tab_header to add a title and subtitle.

mygt <-
  gt(dt_tab) %>%
  fmt_percent(
    columns = vars(month.abb),
    decimals = 1,
    use_seps = FALSE
  ) %>%
  tab_header(title="Virginia House Price Growth",
             subtitle="12-month % change in Freddie Mac House Price Index")

mygt
Virginia House Price Growth
12-month % change in Freddie Mac House Price Index
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 −4.5% −4.4% −4.1% −3.7% −3.3% −2.9% −2.4% −1.7% −1.2% −1.1% −1.1% −1.2%
2012 −0.9% −0.7% −0.5% −0.2% 0.2% 0.7% 1.4% 2.0% 2.6% 3.6% 4.1% 4.2%
2013 4.4% 5.0% 5.4% 5.5% 5.7% 5.8% 5.9% 5.9% 5.5% 4.9% 4.7% 5.2%
2014 5.3% 5.0% 4.6% 4.2% 3.6% 3.1% 2.7% 2.4% 2.4% 2.4% 2.5% 2.3%
2015 1.9% 1.6% 1.3% 1.4% 1.6% 1.9% 2.1% 2.2% 2.5% 2.6% 2.5% 2.4%
2016 2.5% 2.8% 3.1% 3.2% 3.3% 3.3% 3.1% 2.9% 3.0% 3.1% 3.3% 3.3%
2017 3.3% 3.3% 3.5% 4.0% 4.2% 4.2% 4.3% 4.4% 4.1% 3.8% 3.8% 4.1%
2018 4.6% 4.9% 4.6% 3.9% 3.4% 3.3% 3.4% 3.7% 4.1% 4.4% 4.4% 4.2%
2019 3.8% 3.6% 3.7% 4.0% 4.3% 4.5% 4.1% 4.0% 4.0% 3.9% 4.1% 4.6%
2020 5.2% 5.4% 5.1% 4.6% 4.2% 4.5% 4.8% NA NA NA NA NA
mygt %>%  opt_row_striping()
Virginia House Price Growth
12-month % change in Freddie Mac House Price Index
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 −4.5% −4.4% −4.1% −3.7% −3.3% −2.9% −2.4% −1.7% −1.2% −1.1% −1.1% −1.2%
2012 −0.9% −0.7% −0.5% −0.2% 0.2% 0.7% 1.4% 2.0% 2.6% 3.6% 4.1% 4.2%
2013 4.4% 5.0% 5.4% 5.5% 5.7% 5.8% 5.9% 5.9% 5.5% 4.9% 4.7% 5.2%
2014 5.3% 5.0% 4.6% 4.2% 3.6% 3.1% 2.7% 2.4% 2.4% 2.4% 2.5% 2.3%
2015 1.9% 1.6% 1.3% 1.4% 1.6% 1.9% 2.1% 2.2% 2.5% 2.6% 2.5% 2.4%
2016 2.5% 2.8% 3.1% 3.2% 3.3% 3.3% 3.1% 2.9% 3.0% 3.1% 3.3% 3.3%
2017 3.3% 3.3% 3.5% 4.0% 4.2% 4.2% 4.3% 4.4% 4.1% 3.8% 3.8% 4.1%
2018 4.6% 4.9% 4.6% 3.9% 3.4% 3.3% 3.4% 3.7% 4.1% 4.4% 4.4% 4.2%
2019 3.8% 3.6% 3.7% 4.0% 4.3% 4.5% 4.1% 4.0% 4.0% 3.9% 4.1% 4.6%
2020 5.2% 5.4% 5.1% 4.6% 4.2% 4.5% 4.8% NA NA NA NA NA

Will this win me a table competition (https://blog.rstudio.com/2020/09/15/announcing-the-2020-rstudio-table-contest/)? Probably not.

But it’s a useful little pattern.

Computing summary stastics by group with data.table

Another thing I find useful is to compute by group summary statistics like quantiles.

Let’s compute the mean, 25th, 50th (median), and 75th percentiles of annual HPA across states and metro areas in July of each year. July being the last month we have house price data in 2020.

First we’ll need a helper function.

my_summary = function(x){list(n=NROW(x),
                              mean=mean(x,na.rm=TRUE),
                              q25=unname(quantile(x,0.25,na.rm=TRUE)),
                              q50=unname(quantile(x,0.5,na.rm=TRUE)),
                              q75=unname(quantile(x,0.75,na.rm=TRUE))
                              )}

Now we can apply it using .SD notation with data.table (you’re welcome future me).

dt2 <- 
  dt[GEO_Type!="US" & Year>2010,
   ][,as.list(unlist(lapply(.SD,my_summary))),
     .(GEO_Type,Year,Month),
     .SDcols="hpa_yoy"
     ]

# take a look
dt2[Month==7]
##     GEO_Type Year Month hpa_yoy.n hpa_yoy.mean   hpa_yoy.q25  hpa_yoy.q50
##  1:    State 2011     7        51  -0.03482701 -5.316122e-02 -0.029839886
##  2:    State 2012     7        51   0.01830705  7.397581e-05  0.013366741
##  3:    State 2013     7        51   0.07053574  3.668966e-02  0.052478057
##  4:    State 2014     7        51   0.04148700  2.694736e-02  0.035764061
##  5:    State 2015     7        51   0.04485992  2.795387e-02  0.040659438
##  6:    State 2016     7        51   0.04761339  3.135579e-02  0.045991220
##  7:    State 2017     7        51   0.05675597  4.176833e-02  0.056043898
##  8:    State 2018     7        51   0.05634974  4.033939e-02  0.057227738
##  9:    State 2019     7        51   0.04195195  3.497603e-02  0.041217061
## 10:    State 2020     7        51   0.05152229  3.615695e-02  0.049935410
## 11:     CBSA 2011     7       382  -0.03757777 -5.908591e-02 -0.031512591
## 12:     CBSA 2012     7       382   0.01310780 -5.698265e-03  0.009515327
## 13:     CBSA 2013     7       382   0.06462398  2.609328e-02  0.045835414
## 14:     CBSA 2014     7       382   0.03592216  1.285296e-02  0.027820075
## 15:     CBSA 2015     7       382   0.04182938  1.894401e-02  0.036154296
## 16:     CBSA 2016     7       382   0.04616276  2.545555e-02  0.041371023
## 17:     CBSA 2017     7       382   0.05430462  3.405826e-02  0.053600271
## 18:     CBSA 2018     7       382   0.05742927  4.025364e-02  0.056538657
## 19:     CBSA 2019     7       382   0.04494198  3.186373e-02  0.043979345
## 20:     CBSA 2020     7       382   0.05005214  3.377371e-02  0.048743020
##     hpa_yoy.q75
##  1: -0.01545427
##  2:  0.03129702
##  3:  0.09518173
##  4:  0.05176107
##  5:  0.05237482
##  6:  0.06242702
##  7:  0.06958177
##  8:  0.06643227
##  9:  0.04989990
## 10:  0.06552478
## 11: -0.01305736
## 12:  0.02752671
## 13:  0.08001134
## 14:  0.05244303
## 15:  0.05708949
## 16:  0.06520285
## 17:  0.07370181
## 18:  0.07256919
## 19:  0.05834908
## 20:  0.06624088

Now we’ll melt the columns to convert it to long format.

dt3 <-
  melt(dt2, id.vars=c("GEO_Type","Year","Month"))

dt3[]
##       GEO_Type Year Month    variable       value
##    1:    State 2011     1   hpa_yoy.n 51.00000000
##    2:    State 2011     2   hpa_yoy.n 51.00000000
##    3:    State 2011     3   hpa_yoy.n 51.00000000
##    4:    State 2011     4   hpa_yoy.n 51.00000000
##    5:    State 2011     5   hpa_yoy.n 51.00000000
##   ---                                            
## 1146:     CBSA 2020     3 hpa_yoy.q75  0.06599967
## 1147:     CBSA 2020     4 hpa_yoy.q75  0.06335815
## 1148:     CBSA 2020     5 hpa_yoy.q75  0.06084890
## 1149:     CBSA 2020     6 hpa_yoy.q75  0.06381227
## 1150:     CBSA 2020     7 hpa_yoy.q75  0.06624088

Now we’ve got our variables stored as variable. But we want to split them. There’s the handy data.table function tstrsplit() to help us.

dt3[,c("var","stat") :=tstrsplit(variable,".",fixed=TRUE)]
dt3
##       GEO_Type Year Month    variable       value     var stat
##    1:    State 2011     1   hpa_yoy.n 51.00000000 hpa_yoy    n
##    2:    State 2011     2   hpa_yoy.n 51.00000000 hpa_yoy    n
##    3:    State 2011     3   hpa_yoy.n 51.00000000 hpa_yoy    n
##    4:    State 2011     4   hpa_yoy.n 51.00000000 hpa_yoy    n
##    5:    State 2011     5   hpa_yoy.n 51.00000000 hpa_yoy    n
##   ---                                                         
## 1146:     CBSA 2020     3 hpa_yoy.q75  0.06599967 hpa_yoy  q75
## 1147:     CBSA 2020     4 hpa_yoy.q75  0.06335815 hpa_yoy  q75
## 1148:     CBSA 2020     5 hpa_yoy.q75  0.06084890 hpa_yoy  q75
## 1149:     CBSA 2020     6 hpa_yoy.q75  0.06381227 hpa_yoy  q75
## 1150:     CBSA 2020     7 hpa_yoy.q75  0.06624088 hpa_yoy  q75
dt4<- dcast(dt3[Month==7], formula=Year+GEO_Type~stat, value.var="value")
setcolorder(dt4, c("GEO_Type","Year","n","mean","q25","q50","q75"))
  gt(dt4) %>%
  fmt_percent(
    columns = vars("mean","q25","q50","q75"),
    decimals = 1,
    use_seps = FALSE
  ) %>%
  tab_header(title="State and Metro (CBSA) house price growth",
             subtitle="12-month % change in Freddie Mac House Price Index (July)") %>%  
  opt_row_striping()
State and Metro (CBSA) house price growth
12-month % change in Freddie Mac House Price Index (July)
GEO_Type Year n mean q25 q50 q75
CBSA 2011 382 −3.8% −5.9% −3.2% −1.3%
State 2011 51 −3.5% −5.3% −3.0% −1.5%
CBSA 2012 382 1.3% −0.6% 1.0% 2.8%
State 2012 51 1.8% 0.0% 1.3% 3.1%
CBSA 2013 382 6.5% 2.6% 4.6% 8.0%
State 2013 51 7.1% 3.7% 5.2% 9.5%
CBSA 2014 382 3.6% 1.3% 2.8% 5.2%
State 2014 51 4.1% 2.7% 3.6% 5.2%
CBSA 2015 382 4.2% 1.9% 3.6% 5.7%
State 2015 51 4.5% 2.8% 4.1% 5.2%
CBSA 2016 382 4.6% 2.5% 4.1% 6.5%
State 2016 51 4.8% 3.1% 4.6% 6.2%
CBSA 2017 382 5.4% 3.4% 5.4% 7.4%
State 2017 51 5.7% 4.2% 5.6% 7.0%
CBSA 2018 382 5.7% 4.0% 5.7% 7.3%
State 2018 51 5.6% 4.0% 5.7% 6.6%
CBSA 2019 382 4.5% 3.2% 4.4% 5.8%
State 2019 51 4.2% 3.5% 4.1% 5.0%
CBSA 2020 382 5.0% 3.4% 4.9% 6.6%
State 2020 51 5.2% 3.6% 5.0% 6.6%