2020/09/22

R / data wrangling

Been a while since I blogged here. Where does time go? On Twitter, I realized it’s just about time for spooky plots:

U.S. existing home sales hit a seasonally adjusted annual rate of 6 million in August 2020, first time at 6 million since 2006 pic.twitter.com/4ZKrO2d0zN
— 📈 Len Kiefer 📊 (@lenkiefer) September 22, 2020

Maybe too early?

In this post, I want to share a simple R code pattern that’s been useful for me. Using data.table’s dcast function to create a nice table. Together with gt you can make some decent stuff.

library(gt)
library(data.table)
library(tidyverse)

Get some data

Let’s get some data. How about house prices? Let’s get the Freddie Mac House Price Index.

# load data
dt <- fread("http://www.freddiemac.com/fmac-resources/research/docs/fmhpi_master_file.csv")

Create 12-month house price growth rates.

dt[,hpa_yoy:=Index_SA/shift(Index_SA,12)-1,.(GEO_Type,GEO_Code,GEO_Name)]

Make a table

Let’s focus on one area, Virginia (GEO_Name=“VA” in the file), and chart the annual house price growth rates since 2011.

dt_va<- dt[GEO_Name=="VA" & Year>2010,]
head(dt_va)

##    Year Month GEO_Type GEO_Name GEO_Code Index_NSA Index_SA     hpa_yoy
## 1: 2011     1    State       VA        .  151.2373 153.2461 -0.04462630
## 2: 2011     2    State       VA        .  151.2157 153.1026 -0.04435817
## 3: 2011     3    State       VA        .  152.4072 153.0405 -0.04148806
## 4: 2011     4    State       VA        .  154.2866 153.1896 -0.03705511
## 5: 2011     5    State       VA        .  155.9039 153.2191 -0.03329706
## 6: 2011     6    State       VA        .  156.6307 152.9682 -0.02927971

Now we could use dplyr::pivot_wider to make our table, but recently I’ve been using data.table.

dcast(dt_va,Year~Month,value.var="hpa_yoy")

##     Year            1            2            3            4            5
##  1: 2011 -0.044626301 -0.044358166 -0.041488056 -0.037055110 -0.033297061
##  2: 2012 -0.009463327 -0.007026538 -0.004857802 -0.002297582  0.001727123
##  3: 2013  0.044484003  0.050000433  0.053948845  0.055360465  0.056554423
##  4: 2014  0.053463682  0.049736967  0.046012042  0.041707616  0.036186832
##  5: 2015  0.019442867  0.015968104  0.013413606  0.013505313  0.015992844
##  6: 2016  0.024926148  0.028235691  0.031062403  0.032352299  0.032912416
##  7: 2017  0.032755180  0.033205562  0.035454685  0.040142128  0.042344583
##  8: 2018  0.045999614  0.049474007  0.045810536  0.039358855  0.033843593
##  9: 2019  0.038207896  0.036018656  0.036824415  0.039998350  0.043124212
## 10: 2020  0.051758901  0.054343291  0.051047192  0.045608782  0.041815160
##                6           7           8           9          10          11
##  1: -0.029279712 -0.02398674 -0.01685799 -0.01191744 -0.01109453 -0.01141400
##  2:  0.007268559  0.01405302  0.01969788  0.02615740  0.03565749  0.04115821
##  3:  0.058064410  0.05914242  0.05899473  0.05455433  0.04878094  0.04745076
##  4:  0.031178462  0.02682621  0.02406790  0.02385549  0.02417547  0.02486742
##  5:  0.018995642  0.02105085  0.02229313  0.02494160  0.02594623  0.02510597
##  6:  0.032605494  0.03054026  0.02928032  0.02968501  0.03149942  0.03314164
##  7:  0.042195974  0.04341698  0.04417151  0.04145725  0.03836104  0.03776823
##  8:  0.032785103  0.03358091  0.03739476  0.04082979  0.04374959  0.04432131
##  9:  0.044731015  0.04121706  0.04009124  0.03962240  0.03900515  0.04099818
## 10:  0.045235922  0.04844981          NA          NA          NA          NA
##              12
##  1: -0.01172155
##  2:  0.04241783
##  3:  0.05182596
##  4:  0.02285520
##  5:  0.02399759
##  6:  0.03312838
##  7:  0.04090315
##  8:  0.04198894
##  9:  0.04579517
## 10:          NA

Let’s clean it up a bit, converting months to month names, and rounding the values.

dt_va[,mname:=factor(month.abb[Month],levels=month.abb)]
dt_tab <- dcast(dt_va,Year~mname,value.var="hpa_yoy")
dt_tab[]

##     Year          Jan          Feb          Mar          Apr          May
##  1: 2011 -0.044626301 -0.044358166 -0.041488056 -0.037055110 -0.033297061
##  2: 2012 -0.009463327 -0.007026538 -0.004857802 -0.002297582  0.001727123
##  3: 2013  0.044484003  0.050000433  0.053948845  0.055360465  0.056554423
##  4: 2014  0.053463682  0.049736967  0.046012042  0.041707616  0.036186832
##  5: 2015  0.019442867  0.015968104  0.013413606  0.013505313  0.015992844
##  6: 2016  0.024926148  0.028235691  0.031062403  0.032352299  0.032912416
##  7: 2017  0.032755180  0.033205562  0.035454685  0.040142128  0.042344583
##  8: 2018  0.045999614  0.049474007  0.045810536  0.039358855  0.033843593
##  9: 2019  0.038207896  0.036018656  0.036824415  0.039998350  0.043124212
## 10: 2020  0.051758901  0.054343291  0.051047192  0.045608782  0.041815160
##              Jun         Jul         Aug         Sep         Oct         Nov
##  1: -0.029279712 -0.02398674 -0.01685799 -0.01191744 -0.01109453 -0.01141400
##  2:  0.007268559  0.01405302  0.01969788  0.02615740  0.03565749  0.04115821
##  3:  0.058064410  0.05914242  0.05899473  0.05455433  0.04878094  0.04745076
##  4:  0.031178462  0.02682621  0.02406790  0.02385549  0.02417547  0.02486742
##  5:  0.018995642  0.02105085  0.02229313  0.02494160  0.02594623  0.02510597
##  6:  0.032605494  0.03054026  0.02928032  0.02968501  0.03149942  0.03314164
##  7:  0.042195974  0.04341698  0.04417151  0.04145725  0.03836104  0.03776823
##  8:  0.032785103  0.03358091  0.03739476  0.04082979  0.04374959  0.04432131
##  9:  0.044731015  0.04121706  0.04009124  0.03962240  0.03900515  0.04099818
## 10:  0.045235922  0.04844981          NA          NA          NA          NA
##             Dec
##  1: -0.01172155
##  2:  0.04241783
##  3:  0.05182596
##  4:  0.02285520
##  5:  0.02399759
##  6:  0.03312838
##  7:  0.04090315
##  8:  0.04198894
##  9:  0.04579517
## 10:          NA

Now we can touch it up a bit with the gt package.

gt(dt_tab)

Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2011	-0.044626301	-0.044358166	-0.041488056	-0.037055110	-0.033297061	-0.029279712	-0.02398674	-0.01685799	-0.01191744	-0.01109453	-0.01141400	-0.01172155
2012	-0.009463327	-0.007026538	-0.004857802	-0.002297582	0.001727123	0.007268559	0.01405302	0.01969788	0.02615740	0.03565749	0.04115821	0.04241783
2013	0.044484003	0.050000433	0.053948845	0.055360465	0.056554423	0.058064410	0.05914242	0.05899473	0.05455433	0.04878094	0.04745076	0.05182596
2014	0.053463682	0.049736967	0.046012042	0.041707616	0.036186832	0.031178462	0.02682621	0.02406790	0.02385549	0.02417547	0.02486742	0.02285520
2015	0.019442867	0.015968104	0.013413606	0.013505313	0.015992844	0.018995642	0.02105085	0.02229313	0.02494160	0.02594623	0.02510597	0.02399759
2016	0.024926148	0.028235691	0.031062403	0.032352299	0.032912416	0.032605494	0.03054026	0.02928032	0.02968501	0.03149942	0.03314164	0.03312838
2017	0.032755180	0.033205562	0.035454685	0.040142128	0.042344583	0.042195974	0.04341698	0.04417151	0.04145725	0.03836104	0.03776823	0.04090315
2018	0.045999614	0.049474007	0.045810536	0.039358855	0.033843593	0.032785103	0.03358091	0.03739476	0.04082979	0.04374959	0.04432131	0.04198894
2019	0.038207896	0.036018656	0.036824415	0.039998350	0.043124212	0.044731015	0.04121706	0.04009124	0.03962240	0.03900515	0.04099818	0.04579517
2020	0.051758901	0.054343291	0.051047192	0.045608782	0.041815160	0.045235922	0.04844981	NA	NA	NA	NA	NA

We can use gt::fmt_percent to format the percentages and tab_header to add a title and subtitle.

mygt <-
  gt(dt_tab) %>%
  fmt_percent(
    columns = vars(month.abb),
    decimals = 1,
    use_seps = FALSE
  ) %>%
  tab_header(title="Virginia House Price Growth",
             subtitle="12-month % change in Freddie Mac House Price Index")

mygt

Virginia House Price Growth
12-month % change in Freddie Mac House Price Index
Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2011	−4.5%	−4.4%	−4.1%	−3.7%	−3.3%	−2.9%	−2.4%	−1.7%	−1.2%	−1.1%	−1.1%	−1.2%
2012	−0.9%	−0.7%	−0.5%	−0.2%	0.2%	0.7%	1.4%	2.0%	2.6%	3.6%	4.1%	4.2%
2013	4.4%	5.0%	5.4%	5.5%	5.7%	5.8%	5.9%	5.9%	5.5%	4.9%	4.7%	5.2%
2014	5.3%	5.0%	4.6%	4.2%	3.6%	3.1%	2.7%	2.4%	2.4%	2.4%	2.5%	2.3%
2015	1.9%	1.6%	1.3%	1.4%	1.6%	1.9%	2.1%	2.2%	2.5%	2.6%	2.5%	2.4%
2016	2.5%	2.8%	3.1%	3.2%	3.3%	3.3%	3.1%	2.9%	3.0%	3.1%	3.3%	3.3%
2017	3.3%	3.3%	3.5%	4.0%	4.2%	4.2%	4.3%	4.4%	4.1%	3.8%	3.8%	4.1%
2018	4.6%	4.9%	4.6%	3.9%	3.4%	3.3%	3.4%	3.7%	4.1%	4.4%	4.4%	4.2%
2019	3.8%	3.6%	3.7%	4.0%	4.3%	4.5%	4.1%	4.0%	4.0%	3.9%	4.1%	4.6%
2020	5.2%	5.4%	5.1%	4.6%	4.2%	4.5%	4.8%	NA	NA	NA	NA	NA

mygt %>%  opt_row_striping()

Virginia House Price Growth
12-month % change in Freddie Mac House Price Index
Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2011	−4.5%	−4.4%	−4.1%	−3.7%	−3.3%	−2.9%	−2.4%	−1.7%	−1.2%	−1.1%	−1.1%	−1.2%
2012	−0.9%	−0.7%	−0.5%	−0.2%	0.2%	0.7%	1.4%	2.0%	2.6%	3.6%	4.1%	4.2%
2013	4.4%	5.0%	5.4%	5.5%	5.7%	5.8%	5.9%	5.9%	5.5%	4.9%	4.7%	5.2%
2014	5.3%	5.0%	4.6%	4.2%	3.6%	3.1%	2.7%	2.4%	2.4%	2.4%	2.5%	2.3%
2015	1.9%	1.6%	1.3%	1.4%	1.6%	1.9%	2.1%	2.2%	2.5%	2.6%	2.5%	2.4%
2016	2.5%	2.8%	3.1%	3.2%	3.3%	3.3%	3.1%	2.9%	3.0%	3.1%	3.3%	3.3%
2017	3.3%	3.3%	3.5%	4.0%	4.2%	4.2%	4.3%	4.4%	4.1%	3.8%	3.8%	4.1%
2018	4.6%	4.9%	4.6%	3.9%	3.4%	3.3%	3.4%	3.7%	4.1%	4.4%	4.4%	4.2%
2019	3.8%	3.6%	3.7%	4.0%	4.3%	4.5%	4.1%	4.0%	4.0%	3.9%	4.1%	4.6%
2020	5.2%	5.4%	5.1%	4.6%	4.2%	4.5%	4.8%	NA	NA	NA	NA	NA

Will this win me a table competition (https://blog.rstudio.com/2020/09/15/announcing-the-2020-rstudio-table-contest/)? Probably not.

But it’s a useful little pattern.

Computing summary stastics by group with data.table

Another thing I find useful is to compute by group summary statistics like quantiles.

Let’s compute the mean, 25th, 50th (median), and 75th percentiles of annual HPA across states and metro areas in July of each year. July being the last month we have house price data in 2020.

First we’ll need a helper function.

my_summary = function(x){list(n=NROW(x),
                              mean=mean(x,na.rm=TRUE),
                              q25=unname(quantile(x,0.25,na.rm=TRUE)),
                              q50=unname(quantile(x,0.5,na.rm=TRUE)),
                              q75=unname(quantile(x,0.75,na.rm=TRUE))
                              )}

Now we can apply it using .SD notation with data.table (you’re welcome future me).

dt2 <- 
  dt[GEO_Type!="US" & Year>2010,
   ][,as.list(unlist(lapply(.SD,my_summary))),
     .(GEO_Type,Year,Month),
     .SDcols="hpa_yoy"
     ]

# take a look
dt2[Month==7]

##     GEO_Type Year Month hpa_yoy.n hpa_yoy.mean   hpa_yoy.q25  hpa_yoy.q50
##  1:    State 2011     7        51  -0.03482701 -5.316122e-02 -0.029839886
##  2:    State 2012     7        51   0.01830705  7.397581e-05  0.013366741
##  3:    State 2013     7        51   0.07053574  3.668966e-02  0.052478057
##  4:    State 2014     7        51   0.04148700  2.694736e-02  0.035764061
##  5:    State 2015     7        51   0.04485992  2.795387e-02  0.040659438
##  6:    State 2016     7        51   0.04761339  3.135579e-02  0.045991220
##  7:    State 2017     7        51   0.05675597  4.176833e-02  0.056043898
##  8:    State 2018     7        51   0.05634974  4.033939e-02  0.057227738
##  9:    State 2019     7        51   0.04195195  3.497603e-02  0.041217061
## 10:    State 2020     7        51   0.05152229  3.615695e-02  0.049935410
## 11:     CBSA 2011     7       382  -0.03757777 -5.908591e-02 -0.031512591
## 12:     CBSA 2012     7       382   0.01310780 -5.698265e-03  0.009515327
## 13:     CBSA 2013     7       382   0.06462398  2.609328e-02  0.045835414
## 14:     CBSA 2014     7       382   0.03592216  1.285296e-02  0.027820075
## 15:     CBSA 2015     7       382   0.04182938  1.894401e-02  0.036154296
## 16:     CBSA 2016     7       382   0.04616276  2.545555e-02  0.041371023
## 17:     CBSA 2017     7       382   0.05430462  3.405826e-02  0.053600271
## 18:     CBSA 2018     7       382   0.05742927  4.025364e-02  0.056538657
## 19:     CBSA 2019     7       382   0.04494198  3.186373e-02  0.043979345
## 20:     CBSA 2020     7       382   0.05005214  3.377371e-02  0.048743020
##     hpa_yoy.q75
##  1: -0.01545427
##  2:  0.03129702
##  3:  0.09518173
##  4:  0.05176107
##  5:  0.05237482
##  6:  0.06242702
##  7:  0.06958177
##  8:  0.06643227
##  9:  0.04989990
## 10:  0.06552478
## 11: -0.01305736
## 12:  0.02752671
## 13:  0.08001134
## 14:  0.05244303
## 15:  0.05708949
## 16:  0.06520285
## 17:  0.07370181
## 18:  0.07256919
## 19:  0.05834908
## 20:  0.06624088

Now we’ll melt the columns to convert it to long format.

dt3 <-
  melt(dt2, id.vars=c("GEO_Type","Year","Month"))

dt3[]

##       GEO_Type Year Month    variable       value
##    1:    State 2011     1   hpa_yoy.n 51.00000000
##    2:    State 2011     2   hpa_yoy.n 51.00000000
##    3:    State 2011     3   hpa_yoy.n 51.00000000
##    4:    State 2011     4   hpa_yoy.n 51.00000000
##    5:    State 2011     5   hpa_yoy.n 51.00000000
##   ---                                            
## 1146:     CBSA 2020     3 hpa_yoy.q75  0.06599967
## 1147:     CBSA 2020     4 hpa_yoy.q75  0.06335815
## 1148:     CBSA 2020     5 hpa_yoy.q75  0.06084890
## 1149:     CBSA 2020     6 hpa_yoy.q75  0.06381227
## 1150:     CBSA 2020     7 hpa_yoy.q75  0.06624088

Now we’ve got our variables stored as variable. But we want to split them. There’s the handy data.table function tstrsplit() to help us.

dt3[,c("var","stat") :=tstrsplit(variable,".",fixed=TRUE)]
dt3

##       GEO_Type Year Month    variable       value     var stat
##    1:    State 2011     1   hpa_yoy.n 51.00000000 hpa_yoy    n
##    2:    State 2011     2   hpa_yoy.n 51.00000000 hpa_yoy    n
##    3:    State 2011     3   hpa_yoy.n 51.00000000 hpa_yoy    n
##    4:    State 2011     4   hpa_yoy.n 51.00000000 hpa_yoy    n
##    5:    State 2011     5   hpa_yoy.n 51.00000000 hpa_yoy    n
##   ---                                                         
## 1146:     CBSA 2020     3 hpa_yoy.q75  0.06599967 hpa_yoy  q75
## 1147:     CBSA 2020     4 hpa_yoy.q75  0.06335815 hpa_yoy  q75
## 1148:     CBSA 2020     5 hpa_yoy.q75  0.06084890 hpa_yoy  q75
## 1149:     CBSA 2020     6 hpa_yoy.q75  0.06381227 hpa_yoy  q75
## 1150:     CBSA 2020     7 hpa_yoy.q75  0.06624088 hpa_yoy  q75

dt4<- dcast(dt3[Month==7], formula=Year+GEO_Type~stat, value.var="value")
setcolorder(dt4, c("GEO_Type","Year","n","mean","q25","q50","q75"))

  gt(dt4) %>%
  fmt_percent(
    columns = vars("mean","q25","q50","q75"),
    decimals = 1,
    use_seps = FALSE
  ) %>%
  tab_header(title="State and Metro (CBSA) house price growth",
             subtitle="12-month % change in Freddie Mac House Price Index (July)") %>%  
  opt_row_striping()

State and Metro (CBSA) house price growth
12-month % change in Freddie Mac House Price Index (July)
GEO_Type	Year	n	mean	q25	q50	q75
CBSA	2011	382	−3.8%	−5.9%	−3.2%	−1.3%
State	2011	51	−3.5%	−5.3%	−3.0%	−1.5%
CBSA	2012	382	1.3%	−0.6%	1.0%	2.8%
State	2012	51	1.8%	0.0%	1.3%	3.1%
CBSA	2013	382	6.5%	2.6%	4.6%	8.0%
State	2013	51	7.1%	3.7%	5.2%	9.5%
CBSA	2014	382	3.6%	1.3%	2.8%	5.2%
State	2014	51	4.1%	2.7%	3.6%	5.2%
CBSA	2015	382	4.2%	1.9%	3.6%	5.7%
State	2015	51	4.5%	2.8%	4.1%	5.2%
CBSA	2016	382	4.6%	2.5%	4.1%	6.5%
State	2016	51	4.8%	3.1%	4.6%	6.2%
CBSA	2017	382	5.4%	3.4%	5.4%	7.4%
State	2017	51	5.7%	4.2%	5.6%	7.0%
CBSA	2018	382	5.7%	4.0%	5.7%	7.3%
State	2018	51	5.6%	4.0%	5.7%	6.6%
CBSA	2019	382	4.5%	3.2%	4.4%	5.8%
State	2019	51	4.2%	3.5%	4.1%	5.0%
CBSA	2020	382	5.0%	3.4%	4.9%	6.6%
State	2020	51	5.2%	3.6%	5.0%	6.6%

dcasting a data table

Making a presentation table with data.table and gt R packages

Get some data

Make a table

Computing summary stastics by group with data.table

Share!