2017/10/08

mortgage / housing / dataviz / R / data wrangling

TIME FOR ANOTHER DATA WRANGLING AND VISUALIZATION EXTRAVAGANZA. This time we are going to work hard to turn some big data into little data. That is, we’re going to work hard to aggregate several million loan level records into useful summary graphics to tell us about the U.S. mortgage market in 2016.

I’ve been working on a lot of different ways to visualize trends in the mortgage market (see here and here for examples). Let’s add some more visualizations and some R code.

The data

We’ll use the Home Mortgage Disclosure (HMDA) data.

These data provide the closest thing to a publicly-available comprehensive summary of U.S. mortgage market activity that we’ll get (for right now). The recently released data is for 2016 and provides a detailed view of mortgage market activity across the country.

We can load our data from a .csv file we get from U.S. Consumer Financial Protection Bureau’s website. Using their filters, I restricted my data to first-lien mortgage loans originated in 2016 for home purchase and refinance of 1-4 family properties. That should give you just over 7.5 million records. This link will take you to the CFPB webpage page with the filters I used. If you go to that page and click download (could take a while) you’ll end up with a file called hmda_lar.csv, which I have saved in a data directory.

Then, if you run this code you can prepare the data for our analysis.

Load libraries

#####################################################################################
## Load Libraries ##
#####################################################################################
library(tidyverse)
library(cowplot)
library(data.table)
library(ggridges)
library(viridis)
library(ggbeeswarm)
library(geofacet)
library(ggthemes)
library(scales)

Prepare the data

As these data are somewhat lage, we’ll use the data.table() package to wrangle our data.

#####################################################################################
## prepare data ##
#####################################################################################
# Read in csv
mydata<-fread("data/hmda_lar.csv")

# drop unused columns (keep the useful ones)
mydata<-mydata[,list(state_name,state_abbr,county_name,owner_occupancy_name,
                     loan_amount_000s,loan_purpose_name,loan_type_name, 
                     agency_abbr,respondent_id,population, 
                     applicant_income_000s,msamd_name)]

# create some numeric variables
mydata <- mydata[,":="(upb=as.numeric(loan_amount_000s),
                       inc=as.numeric(applicant_income_000s))]
mydata<- mydata[, ratio:=upb/inc]

# create conventional flag
mydata <- mydata[, conv:=ifelse(loan_type_name=="Conventional","Conventional","Nonconventional")]
# create state flag for large states 
mydata <- mydata[, st2:=ifelse(state_abbr %in% 
                       c("CA","TX","FL","NY","IL","CO",
                         "GA","NC","OH","WA","MI", "VA"), 
                     state_abbr,"other")][
                       ,.(upb=sum(upb),loans=.N), by=.(st2,loan_purpose_name)
                       ]

# create summary data

# state totals
sdata<- mydata[ # drop missing income
                .(loans=.N,
                   upb=sum(upb),
                   inc=sum(inc,na.rm=T),
                   medinc=median(inc,na.rm=T),
                   medupb=median(upb,na.rm=T),
                   medrat=median(ratio, na.rm=T)),
               by=state_abbr]

sdata2<-mydata[, # drop missing income, 
               .(loans=.N,
                   upb=sum(upb),
                   inc=sum(inc,na.rm=T),
                   medinc=median(inc,na.rm=T),
                   medupb=median(upb,na.rm=T),
                   medrat=median(ratio, na.rm=T)),
               by=.(state_abbr,loan_purpose_name)
               ]

Examine data

Let’s take a look at the data.

# Header of mydata (large panel of HMDA loan level data)
knitr::kable(head(mydata))

state_name	state_abbr	county_name	owner_occupancy_name	loan_amount_000s	loan_purpose_name	loan_type_name	agency_abbr	respondent_id	population	applicant_income_000s	msamd_name	upb	inc	ratio	conv	st2
Texas	TX	Collin County	Owner-occupied as a principal dwelling	251	Home purchase	Conventional	HUD	0542409990	7724	110	Dallas, Plano, Irving - TX	251	110	2.281818	Conventional	TX
California	CA	Los Angeles County	Owner-occupied as a principal dwelling	542	Refinancing	Conventional	HUD	13-3654244	2397	85	Los Angeles, Long Beach, Glendale - CA	542	85	6.376471	Conventional	CA
Georgia	GA	Forsyth County	Owner-occupied as a principal dwelling	65	Refinancing	Conventional	CFPB	0000480228	2524	38	Atlanta, Sandy Springs, Roswell - GA	65	38	1.710526	Conventional	GA
Florida	FL	Miami-Dade County	Owner-occupied as a principal dwelling	340	Refinancing	Conventional	HUD	7197000003	6252	138	Miami, Miami Beach, Kendall - FL	340	138	2.463768	Conventional	FL
Pennsylvania	PA	Philadelphia County	Owner-occupied as a principal dwelling	221	Refinancing	FHA-insured	HUD	75-2921540	2326		Philadelphia - PA	221	NA	NA	Nonconventional	other
New York	NY	Monroe County	Owner-occupied as a principal dwelling	128	Home purchase	Conventional	HUD	16-1566654	4501	71	Rochester - NY	128	71	1.802817	Conventional	NY

The data are read in as character variables, so I created some numeric versions and dropped some of the columns that we won’t be using here. The CFPB’s data file contains column headers and variable values that are pretty easy to work with.

We also created some summary data sets that will be useful later. Let’s take a look at one of them, sdata.

# state level summary)
knitr::kable(head(sdata[order(-loans)],5))

state_abbr	loans	upb	inc	medinc	medupb	medrat
CA	1079884	445349475	150196718	111	350	3.224138
TX	554140	122835110	60301853	93	187	2.156863
FL	466932	107319889	49193894	79	191	2.484663
IL	293000	67423713	33548109	91	187	2.120690
CO	246136	70200047	25976865	88	255	2.939024

These data show some useful summary state level data. The variable loans is the total number of first lien purchase and refinance loans in the 2016 HMDA data for that state. upb is the sum of all loan amounts (in $1000s), while inc is the sum of all borrower income (in $1000s). The variables medinc and medupb are the median borrower income and median loan amounts, respectively. The variable medrat is the median loan amount-to-income ratio across first-lien home purchase and refinance borrowers in 2016.

Little data

These are largish data, big for me, but small in comparison to some of the other datasets that are out there these days. We’re going to explore these data in detail. Let’s start with a Little Data display showing loan counts.

#####################################################################################
## Plot summary data ##
#####################################################################################
g1<-
  ggplot(data=mydata[ ,.(loans=.N),loan_purpose_name], 
       aes(y=loans,x=loan_purpose_name,fill=loan_purpose_name,
           label=paste0(comma(round(loans/1000000,1)),"M")))+
  geom_col()+
  geom_col(data=mydata[,.(loans=.N,
                          loan_purpose_name="All"),], fill="darkgray")+
  geom_text(data=mydata[,.(loans=.N,
                          loan_purpose_name="All"),],
            vjust=1,size=6,color="white")+
  
  geom_text(vjust=1,size=6,color="white")+
  
  scale_fill_manual(name="  ",values=c("darkgray","royalblue","forestgreen"))+
  theme_minimal()+
  labs(x="loan purpose",y="",
       title="U.S. mortgage loans in 2016",
       subtitle="first-lien home purchase and refinance loans on 1-4 unit properties",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data (not adjusted for coverage)")+
  theme(plot.caption=element_text(hjust=0),
        legend.position="none",
        axis.text.y=element_blank(),
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=14,face="bold",color="black"))

g1

In 2016 HMDA data there were about 7.6 million first-lien home purchase and refinance loans on 1-4 unit properties. NOTE: these numbers are not adjusted for coverage. HMDA does not cover all of the U.S. mortgage market, but the coverage is likely quite high (90% or more) in recent years.

About 3.9 million were home purchase and about 3.7 million were refinance loans. In the U.S. mortgage market it’s common to distinguish between conventional and nonconventional loans. Nonconventional loans are loans insured by the Federal Housing Administration (FHA), guaranteed by the Department of Veterans Affairs (VA) and loans backed by the Farm Service Agency (FSA) or Rural Housing Service (RHS).

# g2
g2<-
  ggplot(data=mydata[ ,.(loans=.N), .(loan_purpose_name,conv)],
         aes(y=loans,x=loan_purpose_name,fill=loan_purpose_name,
             label=comma(round(loans/1000,0))))+
  geom_col()+
  geom_text(vjust=1,color="white")+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  theme_minimal()+
  labs(x="loan purpose",y="loans (in 1000s)",
       title="U.S. mortgage loans in 2016",
       subtitle="first-lien home purchase and refinance loans on 1-4 unit properties",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data (not adjusted for coverage)")+
  theme(plot.caption=element_text(hjust=0),
        legend.position="none",
        axis.text.y=element_blank(),
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=14,face="bold",color="black"))+
  facet_wrap(~conv)
g2

We see here that while the total mortgage market had slightly more home purchase than refinance loans, conventional mortgages were skewed toward refiance loans.

Let’s look at these same data, but weighted by the size of loans. The chart below shows the aggregate dollar volume of home purchase and refinance loans brokend out by coventional and nonconventional loans.

g3<-
  ggplot(data=mydata[ ,.(upb=sum(upb)), .(loan_purpose_name,conv)],
         aes(y=upb,x=loan_purpose_name,fill=loan_purpose_name,
               label=comma(round(upb/1000000,0))))+
  geom_col()+
  geom_text(vjust=1,color="white")+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  theme_minimal()+
  labs(x="loan purpose",y="$ in billions",
       title="U.S. mortgage loans in 2016",
       subtitle="first-lien home purchase and refinance loans on 1-4 unit properties",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data (not adjusted for coverage)")+
  theme(plot.caption=element_text(hjust=0),
        legend.position="none",
        axis.text.y=element_blank(),
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=14,face="bold",color="black"))+
  facet_wrap(~conv)
  
g3

By state

It’s interesting to consider how mortgage activity varies by geography.

Let’s look at a breakdown by state. Get ready for a loooong plot.

g4<-
  ggplot(data=sdata2[order(loans),],
         aes(x=reorder(state_abbr,loans),
             y=loans,fill=loan_purpose_name))+
  geom_col()+coord_flip()+theme_minimal()+
  scale_y_continuous(expand=c(0,0),labels=comma)+
  labs(x="",y="loans",title="Single family mortgage loans by state in 2016",
       subtitle="first-lien home purchase and refinance loans (not adjusted for coverage)",
       caption="@lenkiefer Source: @lenkiefer Source: FFIEC, CFPB HMDA data")+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  theme(plot.caption=element_text(hjust=0),
        legend.position="top",
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=14,face="bold",color="black"))
g4

Another way to look at state data is to use geo_facets. The plot below arranges the state data using geo faceting.

g5<-
  ggplot(data=sdata2[state_abbr !="PR"], aes(x=loan_purpose_name,y=loans,
                        fill=loan_purpose_name,label=round(loans/1000,0)))+
  geom_col(position="dodge",alpha=0.75)+
  coord_flip()+
  geom_text(hjust=0,size=2)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma,limits=c(0,9e5))+
  labs(x="",y="loans in thousands",title="Single family mortgage loans by state in 2016",
       subtitle="first-lien home purchase and refinance loans (1000s, not adjusted for coverage)",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data" )+
  theme(legend.position="top",
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=16,face="bold",color="black",hjust=0),
        plot.caption=element_text(hjust=0))+
  facet_geo(~state_abbr)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")
g5

The U.S. mortgage market is dominated by several large states, California in particular. Let’s just look at the top 12 states and group the smaller states into an other category. We’ll also use cowplot::plot_grid() to arrange the plots.

sdata4<-mydata[, st2:=ifelse(state_abbr %in% 
                               c("CA","TX","FL","NY","IL","CO",
                                 "GA","NC","OH","WA","MI", "VA"), 
                     state_abbr,"other")][
                       ,.(upb=sum(upb),loans=.N), by=.(st2,loan_purpose_name)
                     ]


g6a<-
  ggplot(data=sdata4[st2 !="other",], aes(x=loan_purpose_name,y=loans,
                        fill=paste(loan_purpose_name," "),label=round(loans/1000,0)))+
  geom_col(position="dodge",alpha=0.75)+
  coord_flip()+
  geom_text(hjust=0,size=2.5)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma,limit=c(0,9e5))+
  labs(x="",y="loans in thousands",title="Single family mortgage loans by state in 2016",
       subtitle="first-lien home purchase and refinance loans (1000s, not adjusted for coverage)")+
  facet_wrap(~st2)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="bottom")


g6b<-
  ggplot(data=sdata4[st2 =="other",], aes(x=loan_purpose_name,y=loans,
                                        fill=loan_purpose_name,
                                        label=comma(round(loans/1000,0))))+
  geom_col(position="dodge",alpha=0.75)+
  coord_flip()+
  geom_text(hjust=0,size=2.5)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma,limit=c(0,2e6))+
  labs(x="",y="loans in thousands",
       title="Single family mortgage loans by state in 2016",
       subtitle="first-lien home purchase and refinance loans (1000s, not adjusted for coverage)")+
  theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")

# use cowplot::plot_grid to arrange plots
plot_grid(g6a,g6b+theme(legend.position="none")+labs(title="All other states")+
            labs(caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
            theme(plot.caption=element_text(hjust=0)),
          ncol=1,rel_heights=c(3,1))

Boxplots

Let’s try to move a litle beyond little data displays. Let’s use boxplots to look at the distribution of mortgage loan size by states.

mydata<-mydata[,st2:=ifelse(state_abbr %in% 
                               c("CA","TX","FL","NY","IL","CO",
                                 "GA","NC","OH","WA","MI", "VA"), 
                     state_abbr,"other")]

g7<-
  ggplot(data=mydata[upb !=9999,], aes(y=upb,
                                    x=forcats::fct_reorder(st2,upb),
                                    fill=upb))+
  geom_boxplot(outlier.alpha=0)+coord_flip(ylim=c(0,1.25e3))+
  labs(y="Mortgage loan amount ($1000s)",
       x="",
       title="Distribution of mortgage loan amounts in 2016",
       subtitle="first-lien home purchase and refinance loans for 1-4 unit properties",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
  theme(plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=16,face="bold",color="black",hjust=0),
        plot.caption=element_text(hjust=0))

g7

g8<-
  ggplot(data=mydata[upb !=9999,], aes(y=upb,
                                    x=forcats::fct_reorder(st2,upb),
                                    fill=loan_purpose_name))+
  geom_boxplot(outlier.alpha=0)+coord_flip(ylim=c(0,1.2e3))+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  labs(y="Mortgage loan amount ($1000s)",
       x="") +labs(title="Distribution of mortgage loan amounts in 2016",
            subtitle="first-lien home purchase and refinance loans for 1-4 unit properties",
            caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
            theme(plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
                  legend.position="top",
                  plot.title=element_text(size=16,face="bold",color="black",hjust=0),
                  plot.caption=element_text(hjust=0))
g8

Maybe the boxplots miss something

The boxplots are great, and in many cases work best. See for example this thread based on a question I asked on Twitter:

here seems boxplots are the better #dataviz.
What do you think? pic.twitter.com/X6xNZnzUHk
— 📈 Len Kiefer 📊 (@lenkiefer) October 8, 2017

The example distributions in my tweet-mortgage loan amount-to-income ratios-were pretty smooth. In that case the consensus seemed to favor the boxplot. But what if the distributions aren’t so smooth? It’s not hard to find such an example.

g.ridge<-
    ggplot(data=mydata[upb !=9999 & upb<2e3,],  # leave off very large loan balances 
           aes(x=upb,
               y=forcats::fct_reorder(st2,upb),
               fill=log(..x..)))+
  geom_density_ridges_gradient(rel_min_height=.005)+
  scale_fill_viridis()+
  coord_cartesian(xlim=c(0,1.2e3))+
  labs(x="Mortgage loan amount ($1000s)",
       y="") +labs(title="Distribution of mortgage loan amounts in 2016",
            subtitle="first-lien home purchase and refinance loans for 1-4 unit properties")+
            theme(plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
                  legend.position="none",
                  plot.title=element_text(size=16,face="bold",color="black",hjust=0),
                  plot.caption=element_text(hjust=0))


plot_grid(g.ridge+labs(title="Distribution of mortgage loan amounts ($1000s) in 2016",
                       subtitle="home purchase and refinance loans for 1-4 unit properties")+
            theme(plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
                  plot.title=element_text(size=16,face="bold",color="black",hjust=0)),
          g7+labs(title="",subtitle="",caption="@lenkiefer Source: FFIEC, CFPB HMDA data" )+
            theme(plot.caption=element_text(hjust=0)),
          ncol=1)

In this case the density plot picks up on an important nonlinearity, the clustering of loan amounts around the conforming loan limits. Whether or not a boxplot is sufficient to summarize your data is going to depend. But explore your data, a datasaurus may lurk within.

More geo facets

Let’s try out some more geofaceted plots.

g10<-
  ggplot(data=sdata2[state_abbr !="PR"], 
         aes(x=loan_purpose_name,y=medupb,label=round(medupb,0),
                        fill=loan_purpose_name))+
  geom_col(position="dodge",alpha=0.75)+
  geom_text(vjust=1,size=2.5)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma)+
  labs(x="",y="median upb ($1000s)",title="Median loan size ($1000s) by state in 2016",
       subtitle="first-lien home purchase and refinance loans",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data" )+
  facet_geo(~state_abbr)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")
g10

This plot shows the median loan size by state and loan purpose. Loans tend to be larger on the coasts where home prices also tend to be higher.

We can also look at how the median income for borrowers varies by state.

g11<-
  ggplot(data=sdata2, 
       aes(x=loan_purpose_name,y=medinc,label=round(medinc,0),
                        fill=loan_purpose_name))+
  geom_col(position="dodge",alpha=0.75)+
  geom_text(vjust=1,size=2.5)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma)+
  labs(x="",y="median income ($1000s)",title="Median income ($1000s) by state in 2016",
       subtitle="first-lien home purchase and refinance borrowers",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data" )+
  facet_geo(~state_abbr)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")
g11

This shows a similar, but less pronounced pattern than the loan amount plot. What if we compared the income of borrowers to their loan size?

The next plot shows the median loan amount-to-income ratio by state and loan purpose.

g12<-
  ggplot(data=sdata2,
               aes(x=loan_purpose_name,y=medrat,label=round(medrat,1),
                        fill=loan_purpose_name))+
  geom_col(position="dodge",alpha=0.75)+#coord_flip()+
  geom_text(vjust=1,size=2)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma)+
  labs(x="",y="loan size-to-income ratio",
       title="Median loan size-to-income ratio by state in 2016",
       subtitle="first-lien home purchase and refinance loans",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
  facet_geo(~state_abbr)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")
g12

Two clear patterns emerge here. First, home purchase loans tend to be a larger percent of borrower income than refinance loans. We also see a tendency for coastal areas to have higher loan amount-to-income ratios. So despite higher average incomes in higher cost markets, borrowers tend to borrow more as a percent of income in those markets. A scatterplot might express our point.

g13<-
  ggplot(data=sdata2[state_abbr !="PR"],
       aes(y=medupb,x=medinc,label=state_abbr,
           color=loan_purpose_name))+
  geom_text(size=4)+
  stat_smooth(fill=NA,method="lm",alpha=0.5,linetype=2,size=0.75)+
  scale_color_manual(name="  ",values=c("royalblue","forestgreen"))+
  labs(x="median income ($1000)",
       y="median loan amount ($1000)",
       title="Median loan size-to-income ratio by state in 2016",
       subtitle="first-lien home purchase and refinance loans",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
  facet_wrap(~loan_purpose_name)+theme_minimal()+
  theme(legend.position="none",plot.caption=element_text(hjust=0))
g13

More?

Oh yeah, there’s a bunch more. But that’s enough for today.

We can explore these data in greater detail. There’s much finer geographic detail in these data (down to the county and even Census Tract). Also, in a separate file that Federal Reserve researchers released we have an aggregated state/county time series going back to 1994. I’ve been exploring those data too, and will have more to say about them in a future post. Check back later.

We gotta eventually get to this business:

Analyzing mortgage data with R