Analyzing mortgage data with R

TIME FOR ANOTHER DATA WRANGLING AND VISUALIZATION EXTRAVAGANZA. This time we are going to work hard to turn some big data into little data. That is, we’re going to work hard to aggregate several million loan level records into useful summary graphics to tell us about the U.S. mortgage market in 2016.

I’ve been working on a lot of different ways to visualize trends in the mortgage market (see here and here for examples). Let’s add some more visualizations and some R code.

The data

We’ll use the Home Mortgage Disclosure (HMDA) data.

These data provide the closest thing to a publicly-available comprehensive summary of U.S. mortgage market activity that we’ll get (for right now). The recently released data is for 2016 and provides a detailed view of mortgage market activity across the country.

We can load our data from a .csv file we get from U.S. Consumer Financial Protection Bureau’s website. Using their filters, I restricted my data to first-lien mortgage loans originated in 2016 for home purchase and refinance of 1-4 family properties. That should give you just over 7.5 million records. This link will take you to the CFPB webpage page with the filters I used. If you go to that page and click download (could take a while) you’ll end up with a file called hmda_lar.csv, which I have saved in a data directory.

Then, if you run this code you can prepare the data for our analysis.

Load libraries

#####################################################################################
## Load Libraries ##
#####################################################################################
library(tidyverse)
library(cowplot)
library(data.table)
library(ggridges)
library(viridis)
library(ggbeeswarm)
library(geofacet)
library(ggthemes)
library(scales)

Prepare the data

As these data are somewhat lage, we’ll use the data.table() package to wrangle our data.

#####################################################################################
## prepare data ##
#####################################################################################
# Read in csv
mydata<-fread("data/hmda_lar.csv")

# drop unused columns (keep the useful ones)
mydata<-mydata[,list(state_name,state_abbr,county_name,owner_occupancy_name,
                     loan_amount_000s,loan_purpose_name,loan_type_name, 
                     agency_abbr,respondent_id,population, 
                     applicant_income_000s,msamd_name)]

# create some numeric variables
mydata <- mydata[,":="(upb=as.numeric(loan_amount_000s),
                       inc=as.numeric(applicant_income_000s))]
mydata<- mydata[, ratio:=upb/inc]

# create conventional flag
mydata <- mydata[, conv:=ifelse(loan_type_name=="Conventional","Conventional","Nonconventional")]
# create state flag for large states 
mydata <- mydata[, st2:=ifelse(state_abbr %in% 
                       c("CA","TX","FL","NY","IL","CO",
                         "GA","NC","OH","WA","MI", "VA"), 
                     state_abbr,"other")][
                       ,.(upb=sum(upb),loans=.N), by=.(st2,loan_purpose_name)
                       ]

# create summary data

# state totals
sdata<- mydata[ # drop missing income
                .(loans=.N,
                   upb=sum(upb),
                   inc=sum(inc,na.rm=T),
                   medinc=median(inc,na.rm=T),
                   medupb=median(upb,na.rm=T),
                   medrat=median(ratio, na.rm=T)),
               by=state_abbr]

sdata2<-mydata[, # drop missing income, 
               .(loans=.N,
                   upb=sum(upb),
                   inc=sum(inc,na.rm=T),
                   medinc=median(inc,na.rm=T),
                   medupb=median(upb,na.rm=T),
                   medrat=median(ratio, na.rm=T)),
               by=.(state_abbr,loan_purpose_name)
               ]

Examine data

Let’s take a look at the data.

# Header of mydata (large panel of HMDA loan level data)
knitr::kable(head(mydata))
state_name state_abbr county_name owner_occupancy_name loan_amount_000s loan_purpose_name loan_type_name agency_abbr respondent_id population applicant_income_000s msamd_name upb inc ratio conv st2
Texas TX Collin County Owner-occupied as a principal dwelling 251 Home purchase Conventional HUD 0542409990 7724 110 Dallas, Plano, Irving - TX 251 110 2.281818 Conventional TX
California CA Los Angeles County Owner-occupied as a principal dwelling 542 Refinancing Conventional HUD 13-3654244 2397 85 Los Angeles, Long Beach, Glendale - CA 542 85 6.376471 Conventional CA
Georgia GA Forsyth County Owner-occupied as a principal dwelling 65 Refinancing Conventional CFPB 0000480228 2524 38 Atlanta, Sandy Springs, Roswell - GA 65 38 1.710526 Conventional GA
Florida FL Miami-Dade County Owner-occupied as a principal dwelling 340 Refinancing Conventional HUD 7197000003 6252 138 Miami, Miami Beach, Kendall - FL 340 138 2.463768 Conventional FL
Pennsylvania PA Philadelphia County Owner-occupied as a principal dwelling 221 Refinancing FHA-insured HUD 75-2921540 2326 Philadelphia - PA 221 NA NA Nonconventional other
New York NY Monroe County Owner-occupied as a principal dwelling 128 Home purchase Conventional HUD 16-1566654 4501 71 Rochester - NY 128 71 1.802817 Conventional NY

The data are read in as character variables, so I created some numeric versions and dropped some of the columns that we won’t be using here. The CFPB’s data file contains column headers and variable values that are pretty easy to work with.

We also created some summary data sets that will be useful later. Let’s take a look at one of them, sdata.

# state level summary)
knitr::kable(head(sdata[order(-loans)],5))
state_abbr loans upb inc medinc medupb medrat
CA 1079884 445349475 150196718 111 350 3.224138
TX 554140 122835110 60301853 93 187 2.156863
FL 466932 107319889 49193894 79 191 2.484663
IL 293000 67423713 33548109 91 187 2.120690
CO 246136 70200047 25976865 88 255 2.939024

These data show some useful summary state level data. The variable loans is the total number of first lien purchase and refinance loans in the 2016 HMDA data for that state. upb is the sum of all loan amounts (in $1000s), while inc is the sum of all borrower income (in $1000s). The variables medinc and medupb are the median borrower income and median loan amounts, respectively. The variable medrat is the median loan amount-to-income ratio across first-lien home purchase and refinance borrowers in 2016.

Little data

These are largish data, big for me, but small in comparison to some of the other datasets that are out there these days. We’re going to explore these data in detail. Let’s start with a Little Data display showing loan counts.

#####################################################################################
## Plot summary data ##
#####################################################################################
g1<-
  ggplot(data=mydata[ ,.(loans=.N),loan_purpose_name], 
       aes(y=loans,x=loan_purpose_name,fill=loan_purpose_name,
           label=paste0(comma(round(loans/1000000,1)),"M")))+
  geom_col()+
  geom_col(data=mydata[,.(loans=.N,
                          loan_purpose_name="All"),], fill="darkgray")+
  geom_text(data=mydata[,.(loans=.N,
                          loan_purpose_name="All"),],
            vjust=1,size=6,color="white")+
  
  geom_text(vjust=1,size=6,color="white")+
  
  scale_fill_manual(name="  ",values=c("darkgray","royalblue","forestgreen"))+
  theme_minimal()+
  labs(x="loan purpose",y="",
       title="U.S. mortgage loans in 2016",
       subtitle="first-lien home purchase and refinance loans on 1-4 unit properties",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data (not adjusted for coverage)")+
  theme(plot.caption=element_text(hjust=0),
        legend.position="none",
        axis.text.y=element_blank(),
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=14,face="bold",color="black"))

g1

In 2016 HMDA data there were about 7.6 million first-lien home purchase and refinance loans on 1-4 unit properties. NOTE: these numbers are not adjusted for coverage. HMDA does not cover all of the U.S. mortgage market, but the coverage is likely quite high (90% or more) in recent years.

About 3.9 million were home purchase and about 3.7 million were refinance loans. In the U.S. mortgage market it’s common to distinguish between conventional and nonconventional loans. Nonconventional loans are loans insured by the Federal Housing Administration (FHA), guaranteed by the Department of Veterans Affairs (VA) and loans backed by the Farm Service Agency (FSA) or Rural Housing Service (RHS).

# g2
g2<-
  ggplot(data=mydata[ ,.(loans=.N), .(loan_purpose_name,conv)],
         aes(y=loans,x=loan_purpose_name,fill=loan_purpose_name,
             label=comma(round(loans/1000,0))))+
  geom_col()+
  geom_text(vjust=1,color="white")+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  theme_minimal()+
  labs(x="loan purpose",y="loans (in 1000s)",
       title="U.S. mortgage loans in 2016",
       subtitle="first-lien home purchase and refinance loans on 1-4 unit properties",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data (not adjusted for coverage)")+
  theme(plot.caption=element_text(hjust=0),
        legend.position="none",
        axis.text.y=element_blank(),
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=14,face="bold",color="black"))+
  facet_wrap(~conv)
g2

We see here that while the total mortgage market had slightly more home purchase than refinance loans, conventional mortgages were skewed toward refiance loans.

Let’s look at these same data, but weighted by the size of loans. The chart below shows the aggregate dollar volume of home purchase and refinance loans brokend out by coventional and nonconventional loans.

g3<-
  ggplot(data=mydata[ ,.(upb=sum(upb)), .(loan_purpose_name,conv)],
         aes(y=upb,x=loan_purpose_name,fill=loan_purpose_name,
               label=comma(round(upb/1000000,0))))+
  geom_col()+
  geom_text(vjust=1,color="white")+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  theme_minimal()+
  labs(x="loan purpose",y="$ in billions",
       title="U.S. mortgage loans in 2016",
       subtitle="first-lien home purchase and refinance loans on 1-4 unit properties",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data (not adjusted for coverage)")+
  theme(plot.caption=element_text(hjust=0),
        legend.position="none",
        axis.text.y=element_blank(),
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=14,face="bold",color="black"))+
  facet_wrap(~conv)
  
g3

By state

It’s interesting to consider how mortgage activity varies by geography.

Let’s look at a breakdown by state. Get ready for a loooong plot.

g4<-
  ggplot(data=sdata2[order(loans),],
         aes(x=reorder(state_abbr,loans),
             y=loans,fill=loan_purpose_name))+
  geom_col()+coord_flip()+theme_minimal()+
  scale_y_continuous(expand=c(0,0),labels=comma)+
  labs(x="",y="loans",title="Single family mortgage loans by state in 2016",
       subtitle="first-lien home purchase and refinance loans (not adjusted for coverage)",
       caption="@lenkiefer Source: @lenkiefer Source: FFIEC, CFPB HMDA data")+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  theme(plot.caption=element_text(hjust=0),
        legend.position="top",
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=14,face="bold",color="black"))
g4

Another way to look at state data is to use geo_facets. The plot below arranges the state data using geo faceting.

g5<-
  ggplot(data=sdata2[state_abbr !="PR"], aes(x=loan_purpose_name,y=loans,
                        fill=loan_purpose_name,label=round(loans/1000,0)))+
  geom_col(position="dodge",alpha=0.75)+
  coord_flip()+
  geom_text(hjust=0,size=2)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma,limits=c(0,9e5))+
  labs(x="",y="loans in thousands",title="Single family mortgage loans by state in 2016",
       subtitle="first-lien home purchase and refinance loans (1000s, not adjusted for coverage)",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data" )+
  theme(legend.position="top",
        plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=16,face="bold",color="black",hjust=0),
        plot.caption=element_text(hjust=0))+
  facet_geo(~state_abbr)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")
g5

The U.S. mortgage market is dominated by several large states, California in particular. Let’s just look at the top 12 states and group the smaller states into an other category. We’ll also use cowplot::plot_grid() to arrange the plots.

sdata4<-mydata[, st2:=ifelse(state_abbr %in% 
                               c("CA","TX","FL","NY","IL","CO",
                                 "GA","NC","OH","WA","MI", "VA"), 
                     state_abbr,"other")][
                       ,.(upb=sum(upb),loans=.N), by=.(st2,loan_purpose_name)
                     ]


g6a<-
  ggplot(data=sdata4[st2 !="other",], aes(x=loan_purpose_name,y=loans,
                        fill=paste(loan_purpose_name," "),label=round(loans/1000,0)))+
  geom_col(position="dodge",alpha=0.75)+
  coord_flip()+
  geom_text(hjust=0,size=2.5)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma,limit=c(0,9e5))+
  labs(x="",y="loans in thousands",title="Single family mortgage loans by state in 2016",
       subtitle="first-lien home purchase and refinance loans (1000s, not adjusted for coverage)")+
  facet_wrap(~st2)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="bottom")


g6b<-
  ggplot(data=sdata4[st2 =="other",], aes(x=loan_purpose_name,y=loans,
                                        fill=loan_purpose_name,
                                        label=comma(round(loans/1000,0))))+
  geom_col(position="dodge",alpha=0.75)+
  coord_flip()+
  geom_text(hjust=0,size=2.5)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma,limit=c(0,2e6))+
  labs(x="",y="loans in thousands",
       title="Single family mortgage loans by state in 2016",
       subtitle="first-lien home purchase and refinance loans (1000s, not adjusted for coverage)")+
  theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")

# use cowplot::plot_grid to arrange plots
plot_grid(g6a,g6b+theme(legend.position="none")+labs(title="All other states")+
            labs(caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
            theme(plot.caption=element_text(hjust=0)),
          ncol=1,rel_heights=c(3,1))

Boxplots

Let’s try to move a litle beyond little data displays. Let’s use boxplots to look at the distribution of mortgage loan size by states.

mydata<-mydata[,st2:=ifelse(state_abbr %in% 
                               c("CA","TX","FL","NY","IL","CO",
                                 "GA","NC","OH","WA","MI", "VA"), 
                     state_abbr,"other")]

g7<-
  ggplot(data=mydata[upb !=9999,], aes(y=upb,
                                    x=forcats::fct_reorder(st2,upb),
                                    fill=upb))+
  geom_boxplot(outlier.alpha=0)+coord_flip(ylim=c(0,1.25e3))+
  labs(y="Mortgage loan amount ($1000s)",
       x="",
       title="Distribution of mortgage loan amounts in 2016",
       subtitle="first-lien home purchase and refinance loans for 1-4 unit properties",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
  theme(plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
        plot.title=element_text(size=16,face="bold",color="black",hjust=0),
        plot.caption=element_text(hjust=0))

g7

g8<-
  ggplot(data=mydata[upb !=9999,], aes(y=upb,
                                    x=forcats::fct_reorder(st2,upb),
                                    fill=loan_purpose_name))+
  geom_boxplot(outlier.alpha=0)+coord_flip(ylim=c(0,1.2e3))+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  labs(y="Mortgage loan amount ($1000s)",
       x="") +labs(title="Distribution of mortgage loan amounts in 2016",
            subtitle="first-lien home purchase and refinance loans for 1-4 unit properties",
            caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
            theme(plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
                  legend.position="top",
                  plot.title=element_text(size=16,face="bold",color="black",hjust=0),
                  plot.caption=element_text(hjust=0))
g8

Maybe the boxplots miss something

The boxplots are great, and in many cases work best. See for example this thread based on a question I asked on Twitter:

The example distributions in my tweet-mortgage loan amount-to-income ratios-were pretty smooth. In that case the consensus seemed to favor the boxplot. But what if the distributions aren’t so smooth? It’s not hard to find such an example.

g.ridge<-
    ggplot(data=mydata[upb !=9999 & upb<2e3,],  # leave off very large loan balances 
           aes(x=upb,
               y=forcats::fct_reorder(st2,upb),
               fill=log(..x..)))+
  geom_density_ridges_gradient(rel_min_height=.005)+
  scale_fill_viridis()+
  coord_cartesian(xlim=c(0,1.2e3))+
  labs(x="Mortgage loan amount ($1000s)",
       y="") +labs(title="Distribution of mortgage loan amounts in 2016",
            subtitle="first-lien home purchase and refinance loans for 1-4 unit properties")+
            theme(plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
                  legend.position="none",
                  plot.title=element_text(size=16,face="bold",color="black",hjust=0),
                  plot.caption=element_text(hjust=0))


plot_grid(g.ridge+labs(title="Distribution of mortgage loan amounts ($1000s) in 2016",
                       subtitle="home purchase and refinance loans for 1-4 unit properties")+
            theme(plot.subtitle=element_text(size=14,face="italic",color="darkgray"),
                  plot.title=element_text(size=16,face="bold",color="black",hjust=0)),
          g7+labs(title="",subtitle="",caption="@lenkiefer Source: FFIEC, CFPB HMDA data" )+
            theme(plot.caption=element_text(hjust=0)),
          ncol=1)

In this case the density plot picks up on an important nonlinearity, the clustering of loan amounts around the conforming loan limits. Whether or not a boxplot is sufficient to summarize your data is going to depend. But explore your data, a datasaurus may lurk within.

More geo facets

Let’s try out some more geofaceted plots.

g10<-
  ggplot(data=sdata2[state_abbr !="PR"], 
         aes(x=loan_purpose_name,y=medupb,label=round(medupb,0),
                        fill=loan_purpose_name))+
  geom_col(position="dodge",alpha=0.75)+
  geom_text(vjust=1,size=2.5)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma)+
  labs(x="",y="median upb ($1000s)",title="Median loan size ($1000s) by state in 2016",
       subtitle="first-lien home purchase and refinance loans",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data" )+
  facet_geo(~state_abbr)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")
g10

This plot shows the median loan size by state and loan purpose. Loans tend to be larger on the coasts where home prices also tend to be higher.

We can also look at how the median income for borrowers varies by state.

g11<-
  ggplot(data=sdata2, 
       aes(x=loan_purpose_name,y=medinc,label=round(medinc,0),
                        fill=loan_purpose_name))+
  geom_col(position="dodge",alpha=0.75)+
  geom_text(vjust=1,size=2.5)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma)+
  labs(x="",y="median income ($1000s)",title="Median income ($1000s) by state in 2016",
       subtitle="first-lien home purchase and refinance borrowers",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data" )+
  facet_geo(~state_abbr)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")
g11

This shows a similar, but less pronounced pattern than the loan amount plot. What if we compared the income of borrowers to their loan size?

The next plot shows the median loan amount-to-income ratio by state and loan purpose.

g12<-
  ggplot(data=sdata2,
               aes(x=loan_purpose_name,y=medrat,label=round(medrat,1),
                        fill=loan_purpose_name))+
  geom_col(position="dodge",alpha=0.75)+#coord_flip()+
  geom_text(vjust=1,size=2)+
  scale_fill_manual(name="  ",values=c("royalblue","forestgreen"))+
  scale_y_continuous(expand=c(0,0),labels=comma)+
  labs(x="",y="loan size-to-income ratio",
       title="Median loan size-to-income ratio by state in 2016",
       subtitle="first-lien home purchase and refinance loans",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
  facet_geo(~state_abbr)+theme_minimal()+
  theme(axis.text=element_blank(),
        legend.position="top")
g12

Two clear patterns emerge here. First, home purchase loans tend to be a larger percent of borrower income than refinance loans. We also see a tendency for coastal areas to have higher loan amount-to-income ratios. So despite higher average incomes in higher cost markets, borrowers tend to borrow more as a percent of income in those markets. A scatterplot might express our point.

g13<-
  ggplot(data=sdata2[state_abbr !="PR"],
       aes(y=medupb,x=medinc,label=state_abbr,
           color=loan_purpose_name))+
  geom_text(size=4)+
  stat_smooth(fill=NA,method="lm",alpha=0.5,linetype=2,size=0.75)+
  scale_color_manual(name="  ",values=c("royalblue","forestgreen"))+
  labs(x="median income ($1000)",
       y="median loan amount ($1000)",
       title="Median loan size-to-income ratio by state in 2016",
       subtitle="first-lien home purchase and refinance loans",
       caption="@lenkiefer Source: FFIEC, CFPB HMDA data")+
  facet_wrap(~loan_purpose_name)+theme_minimal()+
  theme(legend.position="none",plot.caption=element_text(hjust=0))
g13

More?

Oh yeah, there’s a bunch more. But that’s enough for today.

We can explore these data in greater detail. There’s much finer geographic detail in these data (down to the county and even Census Tract). Also, in a separate file that Federal Reserve researchers released we have an aggregated state/county time series going back to 1994. I’ve been exploring those data too, and will have more to say about them in a future post. Check back later.

We gotta eventually get to this business:

 Share!