2017/10/09

dataviz / data wrangling / R / mortgage / housing

WHAT TIME IS IT? TIME FOR TIBBLETIME! In this post, I’m going to take the tibbletime package out for a spin. Turns out this package is quite useful for things I tend to do.

We’ll use the tibbletime package to write some R code to extend our ongoing analysis of trends in the U.S. mortgage market (see here for example).

Davis Vaughan (on Twitter) one of the authors of the tibbletime package suggested I take a look:

@lenkiefer if you get some time and are willing, I'd love for you to check out the tibbletime package. https://t.co/pVLCvctouz
— Davis Vaughan (@dvaughan32) October 6, 2017

Here’s the description of the package from CRAN:

Built on top of the ‘tibble’ package, ‘tibbletime’ is an extension that allows for the creation of time aware tibbles. Some immediate advantages of this include: the ability to perform time based subsetting on tibbles, quickly summarising and aggregating results by time periods, and calling functions similar in spirit to the map family from ‘purrr’ on time based tibbles.

Sounds good to me. Let’s try it out.

Getting some time series data

I’ve been exploring mortgage market data form the recently released Home Mortgage Disclosure (HMDA) data. Last time we looked at the 2016 data, but at the en of that post I alluded to some time series data that Federal Reserve Researchers had made available. Let’s look at those data.

In the Federal Reserve Bulletin article by Neil Bhutta, Steven Laufer, and Daniel R. Ringo, there’s an online appendix with HMDA data by county and month 1994-2016 (link to .csv).

We’re going to use that data, but we’ll need to skip the first three rows because they have (useful) summary information.

#####################################################################################
## Load data.table and read in data (via Federal Reserve Webpage) ##
#####################################################################################
library(data.table)
df<-fread("https://www.federalreserve.gov/publications/files/HMDA_data_94-16.CSV",skip=3)
str(df)

## Classes 'data.table' and 'data.frame':   304152 obs. of  16 variables:
##  $ date_application  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ date_action       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ state             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ county            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ year              : int  1994 1994 1994 1994 1994 1994 1994 1994 1994 1994 ...
##  $ month             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ purchase_app      : int  5448 6576 7652 6578 6319 6024 5826 6671 5541 5499 ...
##  $ refi_app          : int  3812 4857 4103 2410 1973 1936 1724 2196 1723 1704 ...
##  $ purchase_orig     : int  3265 3950 4885 4404 4025 3809 3642 4210 3438 3424 ...
##  $ purchase_dol_orig : num  240 291 372 323 302 ...
##  $ refi_orig         : int  2932 3711 2901 1633 1301 1268 1030 1391 1102 1064 ...
##  $ refi_dol_orig     : num  214.8 261.6 200 105 81.7 ...
##  $ purchase_firstlien: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ purchase_ownerocc : num  0.915 0.915 0.915 0.915 0.915 ...
##  $ refi_firstlien    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ refi_ownerocc     : num  0.924 0.924 0.924 0.924 0.924 ...
##  - attr(*, ".internal.selfref")=<externalptr>

The rows we skipped have some important information about these variables. You might want to go ahead and download the .csv file and take a look. Row three contains summary information about the variables. Let’s take a look:

var	description
date_application	data aggregated at the month of application if date_application=1
date_action	data aggregated at the month of action if date_action=1
state	state code (missing for county-level observations)
county	Federal Information Processing Standards (FIPS) county code (missing for state-level observations)
year	year
month	month
purchase_app	home-purchase applications
refi_app	refinance applications
purchase_orig	home-purchase originations
purchase_dol_orig	home-purchase originations, total dollars (millions)
refi_orig	refinance originations
refi_dol_orig	refinance originations, total dollars (millions)
purchase_firstlien	annual first-lien share of home-purchase loans (NA before 2004)
purchase_ownerocc	annual owner-occupied share of home-purchase loans
refi_firstlien	annual first-lien share of refinance loans (NA before 2004)
refi_ownerocc	annual owner-occupied share of refinance loans

There are year and month identifiers but no dates. Let’s create a date variable.

library(tidyverse)
df<- df[,date:=as.Date(ISOdate(year,month,1))]

So now we have a bunch of observations, let’s try to get a sense of what we’re looking at by plotting a time series of state-level purchase originations for the state of Virginia. The data come with FIPS codes for states. There’s a nice file in the maps package that gives us names and abbreviations for state and county FIPS. We’ll use it to append state names onto the data.

We’ll also restrict ourselves to origination trends (rather than applications) so we’ll set date_action==1.

library(maps)
df.state<-data.table(df)[date_action==1 & ! is.na(state),]
data(state.fips)
df.state<-left_join(df.state,select(state.fips,-polyname) %>% 
                      unique(), by=c("state"="fips")) %>% data.table()

# data(state.fips) seems to be missing Alaska (FIPS==2) and Hawaii (FIPS==15)
# so we'll add those back on
df.state[ , abb2:=ifelse(state==2, "AK",
                         ifelse(state==15,"HI",abb))]

Now that we have state abbreviations in our variable abb2 we can more naturally filter the data using state codes. Let’s look at purchase mortgage originations in Virginia.

library(scales)
ggplot(data=filter(df.state,abb2=="VA"), aes(x=date,y=purchase_orig))+
  theme_minimal()+
  geom_line(color="royalblue")+
  labs(x="",y="",title="Purchase Originations by Month in Virginia",
       caption="@lenkiefer Source: HMDA (as reported in 2017, not adjusted for coverage)\n'Residential Mortgage Lending in 2016: Evidence from the Home Mortgage Disclosure Act Data'\nFederal Reserve Bulletin (2017) by Neil Bhutta, Steven Laufer, and Daniel R. Ringo")+
  scale_y_continuous(labels=comma)+
  theme(plot.caption=element_text(hjust=0),
        plot.title=element_text(face="bold",size=18))

We can see a clear seasonal pattern in home purchase activity. It would be nice to smooth that out, say by taking a 12-month rolling average. Enter tibbletime.

Tibbletime makes computing a 12-month rolling average (by group using dplyr’s group_by) quite easy. This vignette has some useful information.

Let’s do it! (If you don’t yet have tibbletime run: install.packages("tibbletime"))

# convert df.state into a tibbletime object
library(tibbletime)
df.state <- as_tbl_time(df.state,index=date)

# Compute rolling 12-month mean
# The function to use at each step is `sum`.
# The window size is 12
rolling_sum <- rollify(mean, window = 12)

# compute rolling sums by group

df.state %>% group_by(state) %>% mutate(purch_orig_m12=rolling_sum(purchase_orig)) %>%
  ungroup()->df.state

ggplot(data=filter(df.state,abb2=="VA"), aes(x=date,y=purch_orig_m12,linetype="Rolling 12-month Average"))+
  theme_minimal()+
  geom_line(color="royalblue",size=1.05)+
  # add original line as dotted line
  geom_line(color="royalblue", alpha=0.5,size=0.75,
            aes(y=purchase_orig,linetype="Monthly values"))+
  scale_linetype_manual(name="",values=c(2,1))+
  labs(x="",y="",title="Purchase Originations by Month in Virginia",
       caption="@lenkiefer Source: HMDA (as reported in 2017, not adjusted for coverage)\n'Residential Mortgage Lending in 2016: Evidence from the Home Mortgage Disclosure Act Data'\nFederal Reserve Bulletin (2017) by Neil Bhutta, Steven Laufer, and Daniel R. Ringo")+
  scale_y_continuous(labels=comma)+
  theme(plot.caption=element_text(hjust=0),legend.position="top",
        plot.title=element_text(face="bold",size=18))

Now we can exploit the convenience of the group_by function to create a geo_faceted small multiple version of this plot.

# convert df.state into a tibbletime object
library(geofacet)

ggplot(data=df.state, aes(x=date,y=purch_orig_m12,linetype="Rolling 12-month Average"))+
  geom_line(color="royalblue",size=1.05)+
  # add original line as dotted line
  geom_line(color="royalblue", alpha=0.5,size=0.75,
            aes(y=purchase_orig,linetype="Monthly values"))+
  scale_linetype_manual(name="",values=c(2,1))+
  labs(x="",y="",title="Purchase Originations by Month and State",
       subtitle="independent scale for each state",
       caption="@lenkiefer Source: HMDA (as reported in 2017, not adjusted for coverage)\n'Residential Mortgage Lending in 2016: Evidence from the Home Mortgage Disclosure Act Data'\nFederal Reserve Bulletin (2017) by Neil Bhutta, Steven Laufer, and Daniel R. Ringo")+
  scale_y_continuous(labels=comma)+
  scale_x_date(date_breaks="10 years",date_labels="%y")+
  theme(plot.caption=element_text(hjust=0),legend.position="top",
        axis.text=element_text(size=6),
        plot.title=element_text(face="bold",size=18))+
  facet_geo(~abb2,scales="free_y")

Summarizing data

Tibbletime also has some nice utilities for summarizing data. Let’s take our (noisy) monthly data and convert it to annual averages. Let’s compute annual total refinance originations by state and year.

df.state %>% group_by(abb2) %>% 
  time_summarize(period="y",
                 refi=sum(refi_orig)) %>% 
  ungroup() -> df.state.y

ggplot(data=df.state.y, aes(x=date,y=refi)) +
  geom_col(fill="forestgreen")+
  scale_linetype_manual(name="",values=c(2,1))+
  labs(x="",y="",title="Annual Refinance Originations by State",
       subtitle="independent scale for each state",
       caption="@lenkiefer Source: HMDA (as reported in 2017, not adjusted for coverage)\n'Residential Mortgage Lending in 2016: Evidence from the Home Mortgage Disclosure Act Data'\nFederal Reserve Bulletin (2017) by Neil Bhutta, Steven Laufer, and Daniel R. Ringo")+
  scale_y_continuous(labels=comma)+
  scale_x_date(date_breaks="10 years",date_labels="%y")+
  theme(plot.caption=element_text(hjust=0),legend.position="top",
        axis.text=element_text(size=6),
        plot.title=element_text(face="bold",size=18))+
  facet_geo(~abb2,scales="free_y")

Cool! That’s pretty easy. We can also use other functions inside the time_summarize function, like last() to get the last observation.

Other things to do

We just took the tibbletime package out for a very brief spin, but already see it’s quite useful. But there’s so much more we can do with it. We can integrate purrr with tibbletime for some interesting results. In a follow-up post, I’ll share some other ideas I have.

What time is it? Time for tibbletime!

Getting some time series data

Summarizing data

Other things to do

Share!