Been a while since I blogged here. Where does time go? On Twitter, I realized it’s just about time for spooky plots:
U.S. existing home sales hit a seasonally adjusted annual rate of 6 million in August 2020, first time at 6 million since 2006 pic.twitter.com/4ZKrO2d0zN
— 📈 Len Kiefer 📊 (@lenkiefer) September 22, 2020
Maybe too early?
In this post, I want to share a simple R code pattern that’s been useful for me. Using data.table’s dcast
function to create a nice table. Together with gt you can make some decent stuff.
library(gt)
library(data.table)
library(tidyverse)
Get some data
Let’s get some data. How about house prices? Let’s get the Freddie Mac House Price Index.
# load data
dt <- fread("http://www.freddiemac.com/fmac-resources/research/docs/fmhpi_master_file.csv")
Create 12-month house price growth rates.
dt[,hpa_yoy:=Index_SA/shift(Index_SA,12)-1,.(GEO_Type,GEO_Code,GEO_Name)]
Make a table
Let’s focus on one area, Virginia (GEO_Name=“VA” in the file), and chart the annual house price growth rates since 2011.
dt_va<- dt[GEO_Name=="VA" & Year>2010,]
head(dt_va)
## Year Month GEO_Type GEO_Name GEO_Code Index_NSA Index_SA hpa_yoy
## 1: 2011 1 State VA . 151.2373 153.2461 -0.04462630
## 2: 2011 2 State VA . 151.2157 153.1026 -0.04435817
## 3: 2011 3 State VA . 152.4072 153.0405 -0.04148806
## 4: 2011 4 State VA . 154.2866 153.1896 -0.03705511
## 5: 2011 5 State VA . 155.9039 153.2191 -0.03329706
## 6: 2011 6 State VA . 156.6307 152.9682 -0.02927971
Now we could use dplyr::pivot_wider to make our table, but recently I’ve been using data.table.
dcast(dt_va,Year~Month,value.var="hpa_yoy")
## Year 1 2 3 4 5
## 1: 2011 -0.044626301 -0.044358166 -0.041488056 -0.037055110 -0.033297061
## 2: 2012 -0.009463327 -0.007026538 -0.004857802 -0.002297582 0.001727123
## 3: 2013 0.044484003 0.050000433 0.053948845 0.055360465 0.056554423
## 4: 2014 0.053463682 0.049736967 0.046012042 0.041707616 0.036186832
## 5: 2015 0.019442867 0.015968104 0.013413606 0.013505313 0.015992844
## 6: 2016 0.024926148 0.028235691 0.031062403 0.032352299 0.032912416
## 7: 2017 0.032755180 0.033205562 0.035454685 0.040142128 0.042344583
## 8: 2018 0.045999614 0.049474007 0.045810536 0.039358855 0.033843593
## 9: 2019 0.038207896 0.036018656 0.036824415 0.039998350 0.043124212
## 10: 2020 0.051758901 0.054343291 0.051047192 0.045608782 0.041815160
## 6 7 8 9 10 11
## 1: -0.029279712 -0.02398674 -0.01685799 -0.01191744 -0.01109453 -0.01141400
## 2: 0.007268559 0.01405302 0.01969788 0.02615740 0.03565749 0.04115821
## 3: 0.058064410 0.05914242 0.05899473 0.05455433 0.04878094 0.04745076
## 4: 0.031178462 0.02682621 0.02406790 0.02385549 0.02417547 0.02486742
## 5: 0.018995642 0.02105085 0.02229313 0.02494160 0.02594623 0.02510597
## 6: 0.032605494 0.03054026 0.02928032 0.02968501 0.03149942 0.03314164
## 7: 0.042195974 0.04341698 0.04417151 0.04145725 0.03836104 0.03776823
## 8: 0.032785103 0.03358091 0.03739476 0.04082979 0.04374959 0.04432131
## 9: 0.044731015 0.04121706 0.04009124 0.03962240 0.03900515 0.04099818
## 10: 0.045235922 0.04844981 NA NA NA NA
## 12
## 1: -0.01172155
## 2: 0.04241783
## 3: 0.05182596
## 4: 0.02285520
## 5: 0.02399759
## 6: 0.03312838
## 7: 0.04090315
## 8: 0.04198894
## 9: 0.04579517
## 10: NA
Let’s clean it up a bit, converting months to month names, and rounding the values.
dt_va[,mname:=factor(month.abb[Month],levels=month.abb)]
dt_tab <- dcast(dt_va,Year~mname,value.var="hpa_yoy")
dt_tab[]
## Year Jan Feb Mar Apr May
## 1: 2011 -0.044626301 -0.044358166 -0.041488056 -0.037055110 -0.033297061
## 2: 2012 -0.009463327 -0.007026538 -0.004857802 -0.002297582 0.001727123
## 3: 2013 0.044484003 0.050000433 0.053948845 0.055360465 0.056554423
## 4: 2014 0.053463682 0.049736967 0.046012042 0.041707616 0.036186832
## 5: 2015 0.019442867 0.015968104 0.013413606 0.013505313 0.015992844
## 6: 2016 0.024926148 0.028235691 0.031062403 0.032352299 0.032912416
## 7: 2017 0.032755180 0.033205562 0.035454685 0.040142128 0.042344583
## 8: 2018 0.045999614 0.049474007 0.045810536 0.039358855 0.033843593
## 9: 2019 0.038207896 0.036018656 0.036824415 0.039998350 0.043124212
## 10: 2020 0.051758901 0.054343291 0.051047192 0.045608782 0.041815160
## Jun Jul Aug Sep Oct Nov
## 1: -0.029279712 -0.02398674 -0.01685799 -0.01191744 -0.01109453 -0.01141400
## 2: 0.007268559 0.01405302 0.01969788 0.02615740 0.03565749 0.04115821
## 3: 0.058064410 0.05914242 0.05899473 0.05455433 0.04878094 0.04745076
## 4: 0.031178462 0.02682621 0.02406790 0.02385549 0.02417547 0.02486742
## 5: 0.018995642 0.02105085 0.02229313 0.02494160 0.02594623 0.02510597
## 6: 0.032605494 0.03054026 0.02928032 0.02968501 0.03149942 0.03314164
## 7: 0.042195974 0.04341698 0.04417151 0.04145725 0.03836104 0.03776823
## 8: 0.032785103 0.03358091 0.03739476 0.04082979 0.04374959 0.04432131
## 9: 0.044731015 0.04121706 0.04009124 0.03962240 0.03900515 0.04099818
## 10: 0.045235922 0.04844981 NA NA NA NA
## Dec
## 1: -0.01172155
## 2: 0.04241783
## 3: 0.05182596
## 4: 0.02285520
## 5: 0.02399759
## 6: 0.03312838
## 7: 0.04090315
## 8: 0.04198894
## 9: 0.04579517
## 10: NA
Now we can touch it up a bit with the gt package.
gt(dt_tab)
Year | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 | -0.044626301 | -0.044358166 | -0.041488056 | -0.037055110 | -0.033297061 | -0.029279712 | -0.02398674 | -0.01685799 | -0.01191744 | -0.01109453 | -0.01141400 | -0.01172155 |
2012 | -0.009463327 | -0.007026538 | -0.004857802 | -0.002297582 | 0.001727123 | 0.007268559 | 0.01405302 | 0.01969788 | 0.02615740 | 0.03565749 | 0.04115821 | 0.04241783 |
2013 | 0.044484003 | 0.050000433 | 0.053948845 | 0.055360465 | 0.056554423 | 0.058064410 | 0.05914242 | 0.05899473 | 0.05455433 | 0.04878094 | 0.04745076 | 0.05182596 |
2014 | 0.053463682 | 0.049736967 | 0.046012042 | 0.041707616 | 0.036186832 | 0.031178462 | 0.02682621 | 0.02406790 | 0.02385549 | 0.02417547 | 0.02486742 | 0.02285520 |
2015 | 0.019442867 | 0.015968104 | 0.013413606 | 0.013505313 | 0.015992844 | 0.018995642 | 0.02105085 | 0.02229313 | 0.02494160 | 0.02594623 | 0.02510597 | 0.02399759 |
2016 | 0.024926148 | 0.028235691 | 0.031062403 | 0.032352299 | 0.032912416 | 0.032605494 | 0.03054026 | 0.02928032 | 0.02968501 | 0.03149942 | 0.03314164 | 0.03312838 |
2017 | 0.032755180 | 0.033205562 | 0.035454685 | 0.040142128 | 0.042344583 | 0.042195974 | 0.04341698 | 0.04417151 | 0.04145725 | 0.03836104 | 0.03776823 | 0.04090315 |
2018 | 0.045999614 | 0.049474007 | 0.045810536 | 0.039358855 | 0.033843593 | 0.032785103 | 0.03358091 | 0.03739476 | 0.04082979 | 0.04374959 | 0.04432131 | 0.04198894 |
2019 | 0.038207896 | 0.036018656 | 0.036824415 | 0.039998350 | 0.043124212 | 0.044731015 | 0.04121706 | 0.04009124 | 0.03962240 | 0.03900515 | 0.04099818 | 0.04579517 |
2020 | 0.051758901 | 0.054343291 | 0.051047192 | 0.045608782 | 0.041815160 | 0.045235922 | 0.04844981 | NA | NA | NA | NA | NA |
We can use gt::fmt_percent
to format the percentages and tab_header
to add a title and subtitle.
mygt <-
gt(dt_tab) %>%
fmt_percent(
columns = vars(month.abb),
decimals = 1,
use_seps = FALSE
) %>%
tab_header(title="Virginia House Price Growth",
subtitle="12-month % change in Freddie Mac House Price Index")
mygt
Virginia House Price Growth | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
12-month % change in Freddie Mac House Price Index | ||||||||||||
Year | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
2011 | −4.5% | −4.4% | −4.1% | −3.7% | −3.3% | −2.9% | −2.4% | −1.7% | −1.2% | −1.1% | −1.1% | −1.2% |
2012 | −0.9% | −0.7% | −0.5% | −0.2% | 0.2% | 0.7% | 1.4% | 2.0% | 2.6% | 3.6% | 4.1% | 4.2% |
2013 | 4.4% | 5.0% | 5.4% | 5.5% | 5.7% | 5.8% | 5.9% | 5.9% | 5.5% | 4.9% | 4.7% | 5.2% |
2014 | 5.3% | 5.0% | 4.6% | 4.2% | 3.6% | 3.1% | 2.7% | 2.4% | 2.4% | 2.4% | 2.5% | 2.3% |
2015 | 1.9% | 1.6% | 1.3% | 1.4% | 1.6% | 1.9% | 2.1% | 2.2% | 2.5% | 2.6% | 2.5% | 2.4% |
2016 | 2.5% | 2.8% | 3.1% | 3.2% | 3.3% | 3.3% | 3.1% | 2.9% | 3.0% | 3.1% | 3.3% | 3.3% |
2017 | 3.3% | 3.3% | 3.5% | 4.0% | 4.2% | 4.2% | 4.3% | 4.4% | 4.1% | 3.8% | 3.8% | 4.1% |
2018 | 4.6% | 4.9% | 4.6% | 3.9% | 3.4% | 3.3% | 3.4% | 3.7% | 4.1% | 4.4% | 4.4% | 4.2% |
2019 | 3.8% | 3.6% | 3.7% | 4.0% | 4.3% | 4.5% | 4.1% | 4.0% | 4.0% | 3.9% | 4.1% | 4.6% |
2020 | 5.2% | 5.4% | 5.1% | 4.6% | 4.2% | 4.5% | 4.8% | NA | NA | NA | NA | NA |
mygt %>% opt_row_striping()
Virginia House Price Growth | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
12-month % change in Freddie Mac House Price Index | ||||||||||||
Year | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
2011 | −4.5% | −4.4% | −4.1% | −3.7% | −3.3% | −2.9% | −2.4% | −1.7% | −1.2% | −1.1% | −1.1% | −1.2% |
2012 | −0.9% | −0.7% | −0.5% | −0.2% | 0.2% | 0.7% | 1.4% | 2.0% | 2.6% | 3.6% | 4.1% | 4.2% |
2013 | 4.4% | 5.0% | 5.4% | 5.5% | 5.7% | 5.8% | 5.9% | 5.9% | 5.5% | 4.9% | 4.7% | 5.2% |
2014 | 5.3% | 5.0% | 4.6% | 4.2% | 3.6% | 3.1% | 2.7% | 2.4% | 2.4% | 2.4% | 2.5% | 2.3% |
2015 | 1.9% | 1.6% | 1.3% | 1.4% | 1.6% | 1.9% | 2.1% | 2.2% | 2.5% | 2.6% | 2.5% | 2.4% |
2016 | 2.5% | 2.8% | 3.1% | 3.2% | 3.3% | 3.3% | 3.1% | 2.9% | 3.0% | 3.1% | 3.3% | 3.3% |
2017 | 3.3% | 3.3% | 3.5% | 4.0% | 4.2% | 4.2% | 4.3% | 4.4% | 4.1% | 3.8% | 3.8% | 4.1% |
2018 | 4.6% | 4.9% | 4.6% | 3.9% | 3.4% | 3.3% | 3.4% | 3.7% | 4.1% | 4.4% | 4.4% | 4.2% |
2019 | 3.8% | 3.6% | 3.7% | 4.0% | 4.3% | 4.5% | 4.1% | 4.0% | 4.0% | 3.9% | 4.1% | 4.6% |
2020 | 5.2% | 5.4% | 5.1% | 4.6% | 4.2% | 4.5% | 4.8% | NA | NA | NA | NA | NA |
Will this win me a table competition (https://blog.rstudio.com/2020/09/15/announcing-the-2020-rstudio-table-contest/)? Probably not.
But it’s a useful little pattern.
Computing summary stastics by group with data.table
Another thing I find useful is to compute by group summary statistics like quantiles.
Let’s compute the mean, 25th, 50th (median), and 75th percentiles of annual HPA across states and metro areas in July of each year. July being the last month we have house price data in 2020.
First we’ll need a helper function.
my_summary = function(x){list(n=NROW(x),
mean=mean(x,na.rm=TRUE),
q25=unname(quantile(x,0.25,na.rm=TRUE)),
q50=unname(quantile(x,0.5,na.rm=TRUE)),
q75=unname(quantile(x,0.75,na.rm=TRUE))
)}
Now we can apply it using .SD notation with data.table (you’re welcome future me).
dt2 <-
dt[GEO_Type!="US" & Year>2010,
][,as.list(unlist(lapply(.SD,my_summary))),
.(GEO_Type,Year,Month),
.SDcols="hpa_yoy"
]
# take a look
dt2[Month==7]
## GEO_Type Year Month hpa_yoy.n hpa_yoy.mean hpa_yoy.q25 hpa_yoy.q50
## 1: State 2011 7 51 -0.03482701 -5.316122e-02 -0.029839886
## 2: State 2012 7 51 0.01830705 7.397581e-05 0.013366741
## 3: State 2013 7 51 0.07053574 3.668966e-02 0.052478057
## 4: State 2014 7 51 0.04148700 2.694736e-02 0.035764061
## 5: State 2015 7 51 0.04485992 2.795387e-02 0.040659438
## 6: State 2016 7 51 0.04761339 3.135579e-02 0.045991220
## 7: State 2017 7 51 0.05675597 4.176833e-02 0.056043898
## 8: State 2018 7 51 0.05634974 4.033939e-02 0.057227738
## 9: State 2019 7 51 0.04195195 3.497603e-02 0.041217061
## 10: State 2020 7 51 0.05152229 3.615695e-02 0.049935410
## 11: CBSA 2011 7 382 -0.03757777 -5.908591e-02 -0.031512591
## 12: CBSA 2012 7 382 0.01310780 -5.698265e-03 0.009515327
## 13: CBSA 2013 7 382 0.06462398 2.609328e-02 0.045835414
## 14: CBSA 2014 7 382 0.03592216 1.285296e-02 0.027820075
## 15: CBSA 2015 7 382 0.04182938 1.894401e-02 0.036154296
## 16: CBSA 2016 7 382 0.04616276 2.545555e-02 0.041371023
## 17: CBSA 2017 7 382 0.05430462 3.405826e-02 0.053600271
## 18: CBSA 2018 7 382 0.05742927 4.025364e-02 0.056538657
## 19: CBSA 2019 7 382 0.04494198 3.186373e-02 0.043979345
## 20: CBSA 2020 7 382 0.05005214 3.377371e-02 0.048743020
## hpa_yoy.q75
## 1: -0.01545427
## 2: 0.03129702
## 3: 0.09518173
## 4: 0.05176107
## 5: 0.05237482
## 6: 0.06242702
## 7: 0.06958177
## 8: 0.06643227
## 9: 0.04989990
## 10: 0.06552478
## 11: -0.01305736
## 12: 0.02752671
## 13: 0.08001134
## 14: 0.05244303
## 15: 0.05708949
## 16: 0.06520285
## 17: 0.07370181
## 18: 0.07256919
## 19: 0.05834908
## 20: 0.06624088
Now we’ll melt the columns to convert it to long format.
dt3 <-
melt(dt2, id.vars=c("GEO_Type","Year","Month"))
dt3[]
## GEO_Type Year Month variable value
## 1: State 2011 1 hpa_yoy.n 51.00000000
## 2: State 2011 2 hpa_yoy.n 51.00000000
## 3: State 2011 3 hpa_yoy.n 51.00000000
## 4: State 2011 4 hpa_yoy.n 51.00000000
## 5: State 2011 5 hpa_yoy.n 51.00000000
## ---
## 1146: CBSA 2020 3 hpa_yoy.q75 0.06599967
## 1147: CBSA 2020 4 hpa_yoy.q75 0.06335815
## 1148: CBSA 2020 5 hpa_yoy.q75 0.06084890
## 1149: CBSA 2020 6 hpa_yoy.q75 0.06381227
## 1150: CBSA 2020 7 hpa_yoy.q75 0.06624088
Now we’ve got our variables stored as variable. But we want to split them. There’s the handy data.table function tstrsplit()
to help us.
dt3[,c("var","stat") :=tstrsplit(variable,".",fixed=TRUE)]
dt3
## GEO_Type Year Month variable value var stat
## 1: State 2011 1 hpa_yoy.n 51.00000000 hpa_yoy n
## 2: State 2011 2 hpa_yoy.n 51.00000000 hpa_yoy n
## 3: State 2011 3 hpa_yoy.n 51.00000000 hpa_yoy n
## 4: State 2011 4 hpa_yoy.n 51.00000000 hpa_yoy n
## 5: State 2011 5 hpa_yoy.n 51.00000000 hpa_yoy n
## ---
## 1146: CBSA 2020 3 hpa_yoy.q75 0.06599967 hpa_yoy q75
## 1147: CBSA 2020 4 hpa_yoy.q75 0.06335815 hpa_yoy q75
## 1148: CBSA 2020 5 hpa_yoy.q75 0.06084890 hpa_yoy q75
## 1149: CBSA 2020 6 hpa_yoy.q75 0.06381227 hpa_yoy q75
## 1150: CBSA 2020 7 hpa_yoy.q75 0.06624088 hpa_yoy q75
dt4<- dcast(dt3[Month==7], formula=Year+GEO_Type~stat, value.var="value")
setcolorder(dt4, c("GEO_Type","Year","n","mean","q25","q50","q75"))
gt(dt4) %>%
fmt_percent(
columns = vars("mean","q25","q50","q75"),
decimals = 1,
use_seps = FALSE
) %>%
tab_header(title="State and Metro (CBSA) house price growth",
subtitle="12-month % change in Freddie Mac House Price Index (July)") %>%
opt_row_striping()
State and Metro (CBSA) house price growth | ||||||
---|---|---|---|---|---|---|
12-month % change in Freddie Mac House Price Index (July) | ||||||
GEO_Type | Year | n | mean | q25 | q50 | q75 |
CBSA | 2011 | 382 | −3.8% | −5.9% | −3.2% | −1.3% |
State | 2011 | 51 | −3.5% | −5.3% | −3.0% | −1.5% |
CBSA | 2012 | 382 | 1.3% | −0.6% | 1.0% | 2.8% |
State | 2012 | 51 | 1.8% | 0.0% | 1.3% | 3.1% |
CBSA | 2013 | 382 | 6.5% | 2.6% | 4.6% | 8.0% |
State | 2013 | 51 | 7.1% | 3.7% | 5.2% | 9.5% |
CBSA | 2014 | 382 | 3.6% | 1.3% | 2.8% | 5.2% |
State | 2014 | 51 | 4.1% | 2.7% | 3.6% | 5.2% |
CBSA | 2015 | 382 | 4.2% | 1.9% | 3.6% | 5.7% |
State | 2015 | 51 | 4.5% | 2.8% | 4.1% | 5.2% |
CBSA | 2016 | 382 | 4.6% | 2.5% | 4.1% | 6.5% |
State | 2016 | 51 | 4.8% | 3.1% | 4.6% | 6.2% |
CBSA | 2017 | 382 | 5.4% | 3.4% | 5.4% | 7.4% |
State | 2017 | 51 | 5.7% | 4.2% | 5.6% | 7.0% |
CBSA | 2018 | 382 | 5.7% | 4.0% | 5.7% | 7.3% |
State | 2018 | 51 | 5.6% | 4.0% | 5.7% | 6.6% |
CBSA | 2019 | 382 | 4.5% | 3.2% | 4.4% | 5.8% |
State | 2019 | 51 | 4.2% | 3.5% | 4.1% | 5.0% |
CBSA | 2020 | 382 | 5.0% | 3.4% | 4.9% | 6.6% |
State | 2020 | 51 | 5.2% | 3.6% | 5.0% | 6.6% |