Visual meditations on house prices, Part 1: data wrangling


THIS POST CONSIDERS recent trends in house prices. Thinking a great deal about house prices, I like to look at many different DATA VISUALIZATIONS of house prices, at the national, state, and metro levels of aggregation. Also, in this post, I will include descriptions of how I built my visualizations, including code. These meditations as I call them, required some data wrangling and some thinking as to how to construct the charts.

As the post was getting long, I decided to split it up. In this post, I’ll handle data wrangling. We’ll prepare the data for analysis both in Excel and R. In following post(s), we’ll get into the details of creating the data visualizations. When we’re through we’ll build something like this:


Let us begin by gathering data.

The data

We’re going to be using the Freddie Mac House Price Index (FMHPI), which is available to the public on Freddie Mac’s webpage. This index is available at monthly frequencies for the nation, all 50 states plus the District of Columbia, and over 300 metro areas (CBSA). The national index is available both seasonally-adjusted (SA), and non-seasonally adjusted (NSA), while the state and metro indices are only available NSA.

Disclosure my team at Freddie Mac helped assemble the Tableau data visualization on the Freddie Mac webpage. This blog posts represents my own view and does not necessarily represent the views of Freddie Mac. We’re using the FMHPI because it’s convenient to work with, has broad coverage, and is publicly available.

FMHPI webpage:


We’ll need to navigate to the archive page to download the data we want. The data are in two excel spreadsheets : one for the national and state indices and : one for the metro indices. We’ll need them both for what follows.

FMHPI data download:


There are R packages that can manipulate data in .xls or .xlsx file formats. I’m not planning on getting proficient at that, instead I’m going to open up my excel bag of tricks to handle some data preparation. First, let’s consider what we get when we download those house price spreadsheets.

State file layout


Metro file layout


Comments on file layouts

Now these are not tidy data. Fortunately, the structure can be pretty quickly wrangled within excel to make our R life easier. My strategy is to employ some Excel tricks to tidy this data. I’ve used the gather tools in the tidyr pacakge and they do work well, but as I’m going to write out a .txt file anyway so I can avoid using a .xls file I might as well use my excel trick to tidy these data.

An excel trick to tidy data

Our goals is to convert the files to a tidy dataset.


And we’re going to do it in about 10 seconds:


This is one of my favorite Excel tricks. You could of course, achieve this just as fast using the gather function in tidyr. But in case you are an Excel user, you can see the details of what I did in the this youtube video.

Because the metro data is laid out in two worksheets, you’ll have to do a little more leg work to get the metro data laid out (and you might have convert the files to .xlsx to handle the extra rows). Also, I’m going to convert the date field, which isn’t a date in the Excel file to a year, month, and date column. Ultimately, we’ll want a file that looks like this:


and this for the metros:


For the metros I included variables called “states” and “state1”, which were created by parsing the metro name. The variable “states” contains a list of all states that the metro area includes, while “state1” includes the primary state (first one listed by OMB). For example, the New York City metro area (New York-Newark-Jersey City, NY-NJ-PA) includes parts of New York, New Jersey, and Pennsylvania, but the primary state is New York. This field could be helpful if we want to group the metros.

Then I save these files as .txt files to read into R. You can download the .txt files here (note I only included the NSA national index)

  1. state and national called fmhpi.txt
  2. metro called fmhpi2.txt
#Load some packages
library(data.table, warn.conflicts = FALSE, quietly=TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly=TRUE)
library(dplyr, warn.conflicts = FALSE, quietly=TRUE)
## Warning: package 'dplyr' was built under R version 3.4.2
library(zoo, warn.conflicts = FALSE, quietly=TRUE)
library(ggrepel, warn.conflicts = FALSE, quietly=TRUE)
## Warning: package 'ggrepel' was built under R version 3.4.2
library(ggthemes, warn.conflicts = FALSE, quietly=TRUE)
library(scales, warn.conflicts = FALSE, quietly=TRUE)
library(animation, warn.conflicts = FALSE, quietly=TRUE)
library(grid, warn.conflicts = FALSE, quietly=TRUE)
library(tidyr, warn.conflicts = FALSE, quietly=TRUE)
## Warning: package 'tidyr' was built under R version 3.4.2

Now we’ll need to load and prepare some data. We’ll do this using the data.table package to set up our data.

#load national & state data
statedata <- fread("data/fmhpi.txt")
statedata$date<-as.Date(statedata$date, format="%m/%d/%Y")

#Now uses some data table caclulations to compute percent changes in house prices by state/metro

#create lags of state
statedata<-statedata[, hpi12 :=  shift(hpi,12), by=state]

#compute rolling min/max
statedata<-statedata[, hpi12min:=rollapply(hpi, 12, min,fill=NA, na.rm=FALSE,align='right'), by=state]
statedata<-statedata[, hpi12max:=rollapply(hpi, 12, max,fill=NA, na.rm=FALSE,align='right'), by=state]

#Do the same for metro data:

#load metro data:
metrodata <- fread("data/fmhpi2.txt")
metrodata$date<-as.Date(metrodata$date, format="%m/%d/%Y")


#create lags of metros
metrodata<-metrodata[, hpi12 :=  shift(hpi,12), by=metro]

#compute 12-month rolling min/max
metrodata<-metrodata[, hpi12min:=rollapply(hpi, 12, min,fill=NA, na.rm=FALSE,align='right'), by=metro]
metrodata<-metrodata[, hpi12max:=rollapply(hpi, 12, max,fill=NA, na.rm=FALSE,align='right'), by=metro]

In the code above we take advantage of the data table structure and some functions to compute some time series calculations that will be helpful for the plots we’ll make later. For example, the diff function is used together with some exponentiation to calculate the monthly, quarterly, and annual house price appreciation (all annualized). I also used the rollapply function to caclulate a rolling 12-month min and max (backward looking by using align=‘right’).

Before we close out this data preparation blog post, let’s at least make one plot from the data. The code below generates a time series plot for the national index from January 2000 to March 2016.

The chart shows the rise, fall, and recovery of national house prices. In this index house prices are still a little below their nominal peak, but trending higher. In the follow-up posts, we’ll get creative with the data visualizations.

Click here for Part 2: sparklines and dots (animated)