20 February 2017

LET US BUILD ON YESTERDAY’S POST (LINK) and construct more VISUALIZATIONS of house prices. In this post, I’ll include some R code so you can play along.

We are going to construct our own GRAND TOUR (Wikipedia) except instead of touring European antiquity, we will explore recent trends in house prices around the continental United States. But we will perhaps still pick up some culture, or at least a new ggplot2 theme.

A new theme

For this post, we’ll most ignore the ticks out business from an earlier post and instead try a different theme. We will use the newly released hrbrthemes package.

Data carpentry

This morning, via twitter I came across the following essay (LINK) arguing that instead of data wrangling, munging and the like, we should describe data manipulation attempts as Data Carpentry. A good idea, and it inspired me to tweet the following poem:

Data carpentry
Some are expert artisans crafting custom design,
Others cogs on efficient assembly line,
Me: Chair wobbling. Seems just fine!

So let us construct our wobbly chair, full of loving care.

Get data

We are going to combine data from two sources. First, we will get house price data from the publicly available Freddie Mac House Price Index and second we’ll use employment data from the Bureau of Labor Statistics (BLS).

For more details on these data see respectively:

Following the house price post, we’ll begin with a text file with metro area house prices, and following the second we’ll get employment data directly from BLS. To make things easier, I’ve post a link to the updated house price text file below, and copied the employment data code below that.

House price data

In order to make our map, we’ll want to merge on the latitude and longitude of the principal city for each metro.

We are going to use the us.cities data that comes with the maps library. To get the cbsa locations, we need to merge on the principal city of each metro area to the us.cities data. The us.cities file has the latitude and longitude of many U.S. cities.

The U.S. Census Bureau has convenient files here that will allows us to map U.S. cities to metro areas. We can grab a mapping of principal cities to CBSA and merge to the us.cities data. I’ve also added metro population (in 2015), which will be useful for sorting later.

I’ve also created a utility file that has the principal state and Census Region and Census Division for that principal state for each metro area. I’ve called that region.txt.

In summary, we’ll need these three files:

Armed with these files we can run the following data preparation code:

################################################
## Load libraries
################################################
library(tidyverse)
library(data.table)
library(viridis)  #we want pretty colors later
library(maps)
library(scales)

#### HRBRTHEMES ################################

### Run to install #############################
#devtools::install_github("hrbrmstr/hrbrthemes")
library(hrbrthemes)
library(gcookbook)
library(extrafont)
################################################

data(us.cities) # from the package maps
cbsa.data <-fread("data/cbsa.city.txt") #our first utility file
cbsa.metro<-cbsa.data[metro.micro=="Metropolitan Statistical Area"]

#create lowercase names
cbsa.metro[,nameL:=tolower(name)]
us.cities<-data.table(us.cities)[,nameL:=tolower(name)]

d<-merge(cbsa.metro,us.cities,by="nameL")
#get rid of duplicates
# see: http://stackoverflow.com/questions/15776064/r-first-observation-by-group-using-data-table-self-join
d<-d[order(-pop)]
d<-d[d[,list(row1 = .I[1]), by = list(cbsa)][,row1]]

# load house price data
dm<-fread("data/fmhpi2016q4metro.txt")
dm$date<-as.Date(dm$date, format="%m/%d/%Y")
#compute year-over-year house price growth
dm[,hpa12:=hpi/shift(hpi,12,fill=NA)-1,by=metro]

# for merging:
setkey(d,cbsa.name)
setkey(dm,metro)
dm2<-merge(dm,d,by.y="cbsa.name",by.x="metro",all.x=T)

#load regions
regions<-fread("data/region.txt")
dm2<-merge(dm2,regions,by.x="state",by.y="statecode")

# Only keep necessary columns
dm2<-dm2[,c("state","statename","division","region","metro","cbsa","lat","long","date","year","month","hpi","hpa12","metro.pop"), with=F]
# Check data
library("htmlTable")
library(lubridate)
# make tables for viewing formatting dates with purr %>% operations
htmlTable(head(dm2 %>% map_if(is.Date, as.character,format="%b %d,%Y") %>% map_if(is.numeric, round,0) %>%as.data.frame() ,10), col.rgroup = c("none", "#F7F7F7"),caption="Merged house price data",
          tfoot="Source: Freddie Mac House Price Index, Census")
Merged house price data
state statename division region metro cbsa lat long date year month hpi hpa12 metro.pop
1 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Jan 01,1975 1975 1 36 399790
2 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Feb 01,1975 1975 2 36 399790
3 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Mar 01,1975 1975 3 37 399790
4 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Apr 01,1975 1975 4 37 399790
5 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 May 01,1975 1975 5 38 399790
6 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Jun 01,1975 1975 6 39 399790
7 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Jul 01,1975 1975 7 39 399790
8 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Aug 01,1975 1975 8 40 399790
9 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Sep 01,1975 1975 9 40 399790
10 AK Alaska Pacific Division West Region Anchorage, AK 11260 61 -149 Oct 01,1975 1975 10 41 399790
Source: Freddie Mac House Price Index, Census

These data have geographic identifies (state=Principal state, metro=name,cbsa=cbsa code, statename=State long name, division=Census division,region=Census region, lat=latitude,long=longtitutde), and data variables (date, year, month), house price information (hpi=house price index, hpa12=year-over-year percent change in index,metro.pop=metro population in 2015).

Testing…

We could just let that data chill for a second, but let’s give that new theme a spin. Let’s make a simple line chart showing the house price index for Columbus, Ohio.

ggplot(data=dm2[metro=="Columbus, OH"],aes(x=date,y=hpi))+
  geom_line()+
  theme_ipsum()+  # try new theme
  labs(x="",y="House Price Index, (NSA, Dec 2000=100)",
       subtitle="Testing a line plot",
       title="House prices in Columbus, Ohio",
       caption="@lenkiefer Source: Freddie Mac House Price Index" )

plot of chunk feb-20-2017-graph-1

For whatever reason, I think the gridlines are a little too dark, so I’d rather they be lightgray. Let’s tweak the theme and redraw with annual house price appreciation for Miami, Florida.

ggplot(data=dm2[metro=="Miami-Fort Lauderdale-West Palm Beach, FL"],aes(x=date,y=hpa12))+
  geom_line()+
  theme_ipsum()+  # try new theme
  scale_y_percent()+  #use scale_y_percent function from hrbrtheme
  labs(x="",y="Annual Percentage Change in House Price Index, (NSA, Dec 2000=100)",
       subtitle="Testing a line plot",
       title="House price appreciation in Miami-Fort Lauderdale-West Palm Beach, FL",
       caption="@lenkiefer Source: Freddie Mac House Price Index" )+
  theme(panel.grid.major=element_line(color="lightgray"),
        panel.grid.minor=element_line(color="lightgray"))+
  geom_hline(yintercept=0) #darken the x axis at 0

plot of chunk feb-20-2017-graph-2

Employment data

Getting the employment data is exactly as in my wrangling employment data post, but we’ll recreate it here for completeness.

emp.data<-fread("https://download.bls.gov/pub/time.series/sm/sm.data.54.TotalNonFarm.All")
emp.series<-fread("https://download.bls.gov/pub/time.series/sm/sm.series")

emp.list<-emp.series[industry_code==0 # get all employment
                     & data_type_code==1 # get employment in thousands
                     & seasonal=="U",]  # get seasonally adjusted data]

emp.area<-fread("https://download.bls.gov/pub/time.series/sm/sm.area",
                col.names=c("area_code","area_name","drop"))[,c("area_code","area_name"),with=F]

emp.st<-fread("https://download.bls.gov/pub/time.series/sm/sm.state",
              col.names=c("state_code","state_name","drop"))[,c("state_code","state_name"),with=F]

# merge data
emp.dt<-merge(emp.data,emp.list,by="series_id",all.y=T)

#create month variable
emp.dt=emp.dt[,month:=as.numeric(substr(emp.dt$period,2,3))]
# (this assignment is to get around knitr/data table printing error)
# see e.g. http://stackoverflow.com/questions/15267018/knitr-gets-tricked-by-data-table-assignment

# M13 = Annual average, drop it:
emp.dt<-emp.dt[month<13,]

#create date variable
emp.dt$date<- as.Date(ISOdate(emp.dt$year,emp.dt$month,1) ) 

# merge on area and state codes
emp.dt<-merge(emp.dt,emp.area,by="area_code")
emp.dt<-merge(emp.dt,emp.st,by="state_code")
emp.dt=emp.dt[,c("state_name","area_name","date","year","month","value"),with=F]

emp.dt=emp.dt[,emp:=as.numeric(value)] #convert value to numeric
# Compute year-over-year change in employment and year-over-year percent change
emp.dt=emp.dt[,emp.yoy:=emp-shift(emp,12,fill=NA),by=c("area_name","state_name")]
emp.dt=emp.dt[,emp.pc:=(emp-shift(emp,12,fill=NA))/shift(emp,12,fill=NA),by=c("area_name","state_name")]
emp.dt=emp.dt[,max.emp.st:=max(emp),by=c("state_name")]
emp.dt=emp.dt[,type:=ifelse(area_name=="Statewide","State","Metro")]

# drop states in c("Puerto Rico","Virgin Islands")
emp.dt=emp.dt[year>1989  &!(state_name %in% c("Puerto Rico","Virgin Islands")),]

Now that we have those data, let’s make a quick plot, replicating our employment plot for Ohio metros, but using the new theme:

#set up recessions:
recessions.df = read.table(textConnection(
  "Peak, Trough
  1973-11-01, 1975-03-01
  1980-01-01, 1980-07-01
  1981-07-01, 1982-11-01
  1990-07-01, 1991-03-01
  2001-03-01, 2001-11-01
  2007-12-01, 2009-06-01"), sep=',',
colClasses=c('Date', 'Date'), header=TRUE)
# trim down
recessions.trim = subset(recessions.df, Peak >= "1990-01-01" )
# Plot employment series for Ohio:
g1<-
  ggplot(data=emp.dt[state_name=="Ohio" & type=="Metro"& year>1989])+
  geom_rect(data=recessions.trim, aes(xmin=Peak, xmax=Trough, ymin=-Inf, ymax=+Inf), fill='gray', alpha=0.5)+
  geom_line(aes(x=date,y=emp.pc,group=area_name))+
  theme_ipsum()+
  facet_wrap(~area_name,ncol=3)+scale_y_continuous(labels=percent)+
  geom_hline(color="black",yintercept=0)+
  labs(x="",y="",title="Annual percent change in total nonfarm employment",
       subtitle="Ohio Metro Areas (NSA)",
       caption="@lenkiefer Source: U.S. Bureau of Labor Statistics")+
  theme(panel.grid.major=element_line(color="lightgray"),
        panel.grid.minor=element_line(color="lightgray"))
g1

plot of chunk feb-20-2017-ohio-plot-1

Okay, we’ve got the data we want. Now let’s put it together. Things are about to get slightly interesting.

Combining data

We want to merge our house price data in dm2 with the employment data in emp.dt. Fortunately, we have a common key in metro names and for the most part names line up. However, it turns out that the metros in New England are recorded as NECTAs rather than MSAs.

There are some options, but we’ll choose the hacksaw rather than the scalpel. We’re really only going to be interested in the largest metro areas, and the only large metro area in New England is Boston, so we’ll take care of that and ignore the smaller metro areas for today. For our purposes the Boston NECTA and Boston MSA are close enough. For details see: https://www.census.gov/population/metro/data/def.html.

# replace Boston-Cambridge-Nashua, MA-NH NECTA with Boston-Cambridge-Newton, MA-NH
# FIX for Boston (approximate)
emp.dt[,area_name:=ifelse(area_name=="Boston-Cambridge-Newton, MA NECTA Division",
                "Boston-Cambridge-Newton, MA-NH",
                area_name)]

# Create common key mergind date and metro
dm2[,md:=paste(as.character(date),metro)]
emp.dt[,md:=paste(as.character(date),area_name)]
dt<-merge(dm2,emp.dt2[,c("md","emp","emp.yoy","emp.pc","type"),with=F],by="md")

Now that w ehave our merged data, let’s create a plot showing the relationship of year-over-year house price percentage changes to employment percentage changes by metro area in December of 2016.

ggplot(data=dt[date=="2016-12-01",],
       aes(x=emp.pc,y=hpa12,color=region,size=metro.pop))+
  geom_point(alpha=0.82)+
  theme_ipsum()+
  scale_color_ipsum() + # special theme colors
  scale_y_percent()+
  scale_x_percent()+
      geom_hline(yintercept=0,color="black")+
    guides(size=F)+
    geom_vline(xintercept=0,color="black")+
  labs(x="Annual percentage change\nin metro employment",
       y="Annual percentage change\nin metro house prices",
       title="Metros with stronger labor markets \nexperience stronger house price growth",
       subtitle="House prices and employment in December 2016",
       caption="@lenkiefer Source: Freddie Mac House Price Index, U.S. Bureau of Labor Statistics\nEach dot represents one metro area color coded by region and size based on metro area population (2015).")+
    theme(panel.grid.major=element_line(color="lightgray"),
          panel.grid.minor=element_line(color="lightgray"),
          legend.position="bottom")

plot of chunk feb-20-2017-graph-3

This chart tells a host of stories. In general house prices rise with stronger employment growth (we’ll soon see that’s true across most time periods). Also, there are regional differences. While I’ve color coded the regions, it might not jump out, so let’s use facet_wrap to create a small multiple by region and add a regression line.

ggplot(data=dt[date=="2016-12-01",],
       aes(x=emp.pc,y=hpa12,color=region,size=metro.pop))+
  geom_point(alpha=0.82)+
  theme_ipsum()+
  scale_color_ipsum() + # special theme colors
  scale_y_percent()+
  scale_x_percent()+
      geom_hline(yintercept=0,color="black")+
    guides(size=F)+
    geom_vline(xintercept=0,color="black")+
  labs(x="Annual percentage change\nin metro employment",
       y="Annual percentage change\nin metro house prices",
       title="Metros with stronger labor markets \nexperience stronger house price growth",
       subtitle="House prices and employment in December 2016",
       caption="@lenkiefer Source: Freddie Mac House Price Index, U.S. Bureau of Labor Statistics\nEach dot represents one metro area color coded by region and size based on metro area population (2015).")+
    theme(panel.grid.major=element_line(color="lightgray"),
          panel.grid.minor=element_line(color="lightgray"),
          legend.position="bottom")+
  facet_wrap(~region)+stat_smooth(method="lm",size=0.75,formula=y~x,se=F)

plot of chunk feb-20-2017-graph-4

Here, we can see that the positive relationship between house price growth and employment appears to be strongest in the South and West Regions.

Animate it

Of course, this scatterplot is practically dying to be animated, so let’s do it!

If you’ve been about, you know the drill. For the newcomers (welcome!, consider following me), see this simple example.

For smoother animations we’ll use tweenr. See my earlier post about tweenr for an introduction to tweenr, and more examples here and here.

library(tweenr)
library(animation)

# Function for data prepartion
# Subset on December
myf<-function(in.y){
  dt2<-subset(dt, year==in.y & month==12)
  dt2 %>% map_if(is.character, as.factor) %>% as.data.frame -> dt.out
  dt.out$year<-factor(dt.out$year)
  return(dt.out)
}

# cycle through years
my.list<-lapply(c(2016,seq(1991,2016,1)),myf)

# Tweenr functions
tf <- tween_states(my.list, tweenlength= 2, statelength=3, ease=rep('cubic-in-out',3),
                   nframes=300)

tf<-data.table(tf)

# Loop to animate:
oopt = ani.options(interval = 0.15)
saveGIF({for (i in 1:max(tf$.frame)) {
  g<-
    ggplot(data=tf[.frame==i],
           aes(x=emp.pc,y=hpa12,group=metro,color=region,size=metro.pop))+
    geom_hline(yintercept=0,color="black")+
    guides(size=F)+
    geom_vline(xintercept=0,color="black")+
    geom_point(alpha=0.82)+theme_ipsum()+  scale_color_ipsum(name="Region") +
    scale_y_continuous(label=percent,limits=c(-.41,.4),breaks=seq(-.4,.4,.1))+
    scale_x_continuous(label=percent,limits=c(-.3,.2),breaks=seq(-.3,.2,.1))+
    theme(panel.grid.major=element_line(color="lightgray"),
          panel.grid.minor=element_line(color="lightgray"),
          legend.position="bottom")+
    labs(x="Annual percentage change in metro employment",
         y="Annual percentage change in metro house prices",
         title="Metros with stronger labor markets \nexperience stronger house price growth",
         subtitle=paste("House prices and employment in December",tf[.frame==i]$year),
         caption="@lenkiefer Source: Freddie Mac House Price Index, U.S. Bureau of Labor Statistics\nEach dot represents one metro area color coded by region and size based on metro area population (2015).")
  print(g)
  print(paste(i,"out of",max(tf$.frame)))
  ani.pause()}
},movie.name="emp hpi tween 02 20 2017.gif",ani.width = 650, ani.height = 550)

Running this generates:

scatter gif weekly

And if we simply add facet_wrap(~region) to the g function we get:

scatter gif weekly

Let’s take a tour

Now that we’ve got our wobbly data tables let’s go ahead and build up a tour.

The idea is to generate a composite plot combining a map (to tell us where we are), a scatterplot and two line plots. Then we’ll animate it.

The composite static plot

Let’s first build our static plot for a particular metro area, Washington D.C.

We’ll call the multiplot function to combine ggplot graphs and wrap it in a function.

myplot<-function(df){

#make a map (sorry no AlbersUSA today)
  g.map<-
    ggplot(df, aes(x = long, y = lat)) +
    borders("state",  colour = "grey70",fill="lightgray",alpha=0.5)+
    theme_void()+
    theme(legend.position="none",
          plot.title=element_text(face="bold",size=18))+
    geom_point(alpha=0.82,color="black",size=3)+
    labs(title="House price & employment trends",
         subtitle=head(df,1)$metro,
         caption="@lenkiefer Source: Freddie Mac House Price Index, U.S. Bureau of Labor Statistics")+
    theme(plot.caption=element_text(hjust=0))

  #house price bar  
  g.bar<-
    ggplot(data=df,aes(x=date,y=hpa12,fill=hpa12))+geom_col()+
    theme_ipsum()+
    scale_y_continuous(label=percent,limits=c(-.45,.45),breaks=seq(-.45,.45,.15))+
    scale_fill_viridis(option="B",limits=c(-.45,.45))+
    labs(x="",y="",
         title="House Price Appreciation",
         subtitle="year-over-year percent change in index")+
    theme(plot.caption=element_text(hjust=0),
          panel.grid.major=element_line(color="lightgray"),
          panel.grid.minor=element_line(color="lightgray"),
          legend.position="none")
  
  #employment bar
  g.bar2<-
    ggplot(data=df,aes(x=date,y=emp.pc,fill=emp.pc))+geom_col()+
    theme_ipsum()+
    scale_y_continuous(label=percent,limits=c(-.11,.11),breaks=seq(-.1,.1,.05))+
    scale_fill_viridis(option="B",limits=c(-.11,.11))+
    labs(x="",y="",
         title="Employment growth",
         subtitle="year-over-year percent change metro employment")+
    theme(plot.caption=element_text(hjust=0),
          panel.grid.major=element_line(color="lightgray"),
          panel.grid.minor=element_line(color="lightgray"),
          legend.position="none")
  # scatter
  g.scatter<-
    ggplot(data=df,aes(x=emp.pc,y=hpa12))+
    geom_point()+theme_ipsum()+
    geom_hline(yintercept=0,color="black")+
    guides(size=F)+
    geom_vline(xintercept=0,color="black")+
    geom_point(alpha=0.82)+theme_ipsum()+  scale_color_ipsum(name="Region") +
    scale_y_continuous(label=percent,limits=c(-.41,.4),breaks=seq(-.4,.4,.1))+
    scale_x_continuous(label=percent,limits=c(-.11,.11),breaks=seq(-.1,.1,.05))+
    theme(panel.grid.major=element_line(color="lightgray"),
          panel.grid.minor=element_line(color="lightgray"),
          legend.position="bottom")+
    labs(x="Annual percentage change\n in metro employment",
         y="Annual percentage change\n in metro house prices",
         title="",
         subtitle="",
         caption="")
  
    #combine
  g<-multiplot(g.map,g.scatter,g.bar,g.bar2,
               layout=matrix(c(1,2,3,4), nrow=2, byrow=TRUE))
  
  return(g)
}

# test for Washington DC
df<-subset(dt,metro=="Washington-Arlington-Alexandria, DC-VA-MD-WV")
myplot(df)

plot of chunk feb-20-2017-graph-5

## NULL

This plot shows what you would expect, that periods of strong employment growth are generally periods of strong house price growth and vice versa.

Now, using our tweenr tricks, we can creat an animated tour:

# sort metros by metro.pop
cbsa.list3<-d[order(-metro.pop)]$cbsa.name


myf<-function(i){
  dt2<-subset(dt,metro==d[order(-metro.pop)]$cbsa.name[i] & year>1989)
  dt2 %>% map_if(is.character, as.factor) %>% as.data.frame -> dt.out
  return(dt.out)
}

# take top 20 metros
my.list<-lapply(c(seq(1,20),1),myf)

tf <- tween_states(my.list, tweenlength= 2, statelength=3, ease=rep('cubic-in-out',3),
                   nframes=200)

tf<-data.table(tf)

oopt = ani.options(interval = 0.15)
saveGIF({for (i in 1:max(tf$.frame)) {
  g<-myplot(tf[.frame==i])
  print(g)
  print(paste(i,"out of",max(tf$.frame)))
  ani.pause()}
},movie.name="geo tour emp hpi tween 02 20 2017.gif",ani.width = 650, ani.height = 600)

Running this generates:

tour gif

Other options

We could use plotly and crosstalk to make an interactive version of this. Indeed, that’s exactly what I did in this post A guide to building an interactive flexdashboard where you’ll find detailed instructions.

Perhaps you’ll find this post useful, and be inspired to construct your own wobbly table with data carpentry.