04 December 2016

Introduction

I HAVE BEEN WATCHING SOME VIDEOS of Plotcon 2016. All of the videos I’ve watched are worth watching (check out the playlist), but I was particularly interested in this one from Hadley Wickham:

Among other things Hadley talks about the idea of nesting data frames, models and model results within a data frame. That idea struck me as something that could be quite useful and not at all something that could lead to explosive increase in the size of data frames or unending loops.

We’ll try these ideas out in a simple, stylized forecasting exercise. We’ll use R to explore and model.

The data

I happen to have some data handy that’s perfect for this exercise. It’s the Freddie Mac House Price Index I’ve been using for my series of Visual Meditations on House Prices.

We’re going to use data house prices for the United States, each of the 50 states and the District of Columbia collected in the following file:

  1. state and national called fmhpi2016q3.txt

Important note: Though I’m going to use house prices as an illustrative example, this shouldn’t be interpreted as my recommendation for a reasonable way to model house prices in any way. This is just for fun, and trying out some coding things.

The strategy

Here’s what we’re going to do. We’re going to take our house price data, a dataset with monthly observations on the house price index from January 1975 to September 2016, and construct a simple forecast for house price growth (year-over-year percent change) for each state rolling forward from 1985.

To make things simple we’ll subset the data to just include observations in September of each (the last month available for 2016). That will reduce the number of observations and get rid of overlapping time series observations.

We can apply the techniques outlined in Hadley’s Plotcon talk to organize our results in a single data frame.

The old way

What I would usually do in this situation, is construct a series of loops to iterate over each state and each time period. Something like this (where forecast.function, and stack.data.function are some functions that compute forecasts and organize the output data respectively):

for (i: 1:N.states){                                                    # iterate over states
  for (t: 1:T.months){                                                  # iterate over time periods
    newdata<-filter(data, date<=T & state==i)                           # filter data, state=i, date <= T
    forecast.data<-forecast.function(newdata)                           # forecast function
    output.data<-stack.data.function(output.data,forecast.data)         # organize data function
    }
  }

Nest with map

Now we can avoid the loops by using the map or map2 functions from the purrr package. Instead of loops, we can take a dataframe with a list of states and dates and then use the map2 function to store our model results in the same dataframe. Like so:

output.data<- data %>% mutate(mod=map2(state,date,forecast.function))

Besides being more efficient to write, this will result in a nice structure and we don’t have to worry about index numbers and aligning things if we want to add or subtract rows.

Example using house prices

Let’s load packages, import the data from the text file above, and do some data manipulations:

#load libraries
library(data.table, warn.conflicts = FALSE, quietly=TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly=TRUE)
library(dplyr, warn.conflicts = FALSE, quietly=TRUE)
library(zoo, warn.conflicts = FALSE, quietly=TRUE)
library(ggrepel, warn.conflicts = FALSE, quietly=TRUE)
library(ggthemes, warn.conflicts = FALSE, quietly=TRUE)
library(scales, warn.conflicts = FALSE, quietly=TRUE)
library(tidyr, warn.conflicts = FALSE, quietly=TRUE)
library(zoo,warn.conflicts=F,quietly=T)
library(purrr,warn.conflicts=F,quietly=T)
library(xts,warn.conflicts=F,quietly=T)
library(lubridate,warn.conflicts=F,quietly=T)
library(viridis,warn.conflicts = F,quietly = F) #for the colorz

#load data from text file
d.state <- fread("data/fmhpi2016q3.txt")
#set up date variable
d.state$date<-as.Date(d.state$date, format="%m/%d/%Y")
#get list of states
state.list<-unique(d.state$state)

# construct variable hpa12 which is 12-month percentage change in house price index
# using data.table for convenience of rolling operations across groups
d.state[,hpa12:=c(rep(NA,12),((1+diff(hpi,12)/hpi))^1)-1,by=state]

#subset so that year >1975 and month ==9
d.state<-d.state[month==9 & year>1975]

Now that we have our data loaded let’s take a peek at the structure:

library("htmlTable")
# make tables for viewing formatting dates with purr %>% operations
htmlTable(head(d.state %>% map_if(is.Date, as.character,format="%b-%Y") %>% map_if(is.numeric, round,3) %>%as.data.frame() ,10), col.rgroup = c("none", "#F7F7F7"),caption="State House Price Data",
          tfoot="Source: Freddie Mac House Price Index")
State House Price Data
date state hpi year month type hpa12
1 Sep-1976 AK 41.953 1976 9 State 0.077
2 Sep-1976 AL 38.775 1976 9 State 0.073
3 Sep-1976 AR 41.717 1976 9 State 0.084
4 Sep-1976 AZ 30.269 1976 9 State 0.039
5 Sep-1976 CA 19.974 1976 9 State 0.162
6 Sep-1976 CO 22.027 1976 9 State 0.066
7 Sep-1976 CT 27.238 1976 9 State 0.06
8 Sep-1976 DC 22.244 1976 9 State 0.101
9 Sep-1976 DE 28.837 1976 9 State 0.011
10 Sep-1976 FL 34.051 1976 9 State 0.036
Source: Freddie Mac House Price Index

What we want to do is forecast the house price appreciation (“hpa12”: year-over-year percent change in index) at different points in time.

Working on a single state

Before we get into the nesting business, let’s just do it for a single state, for a single month. Let’s extract the history of house price growth (hereafter HPA) in Virginia from 1976 through 2016 in September of each year:

# just plot data for VA

ggplot(data=d.state[state=="VA",],aes(x=date,y=hpa12,label=paste(percent(hpa12))))+
  geom_line(color=viridis(10)[1],size=1.1)+
  geom_point(data=tail(d.state[state=="VA",],1),color=viridis(10)[8],alpha=0.82,size=3)+
  geom_text(data=tail(d.state[state=="VA",],1),color=viridis(10)[8],alpha=0.82,size=3,hjust=0,vjust=-1)+
  scale_y_continuous(label=percent)+
  coord_cartesian(xlim=c(as.Date("1976-01-01"),as.Date("2017-12-31")))+
  theme_minimal()+labs(x="",y="",title="Annual percentage change in house prices",
                       subtitle="Virginia in September",
                       caption="@lenkiefer Source: Freddie Mac House Price Index")+
  theme(plot.caption=element_text(hjust=0))

plot of chunk va-plot-recursion-dec4-2016,

House prices in Virginia are growing by just under 3 percent year-over-year in September 2016, up from the lows of 2009, but below the middle part of last decade and also a bit below the long-run average.

Let’s construct a simple times series forecasting model for house prices (see my note above, this is just for illustration) based onthe history of HPA. There are many models we could use, but one of the simplest is an Autoregressive Model.We can code this pretty easily in R, but working with dates can be tricky. I’ve tried a couple things, but xts works for me.

We use a simple autoregressive model fit to up to two lags of HPA.

  df<-d.state[state=="VA",]                       #just get data for VA
  va.ts<-xts(df$hpa12,df$date)                    #create xts time series of hpa12 with data
  va.out<-ar.ols(va.ts,order.max=2)               #construct ar model
  va.out
## 
## Call:
## ar.ols(x = va.ts, order.max = 2)
## 
## Coefficients:
##       1        2  
##  0.9019  -0.2672  
## 
## Intercept: -0.0009794 (0.00611) 
## 
## Order selected 2  sigma^2 estimated as  0.001456

There’s quite a bit of persistence in the HPA series, reflected in the coefficients. We might be concerned about stationarity with this series. We can use some nice functions from Rob Hyndman to check:

source("code/roots.R")  #load AR roots
# see http://robjhyndman.com/hyndsight/arma-roots/ 
# for code
plot.armaroots(arroots(va.out))

plot of chunk va-ar-recursion-2-dec4-2016,

Now we can use this simple AR model to forecast house prices. The code below will stack the predictions to the input data. Then we’ll make a simple forecast plot:

  va.p<-predict(va.out, n.ahead = 4)              #forecasts from ar model
  va.p.ts<-xts(va.p$pred,
                 seq(max(df$date)+years(1),
                     max(df$date)+years(4),"1 year"))
  va.ts<-rbind(va.ts,va.p.ts)
  f.data<-data.frame(date=index(va.ts), 
                     hpa12=coredata(va.ts))
    f.data$f<-ifelse(f.data$date>max(df$date),"forecast","actual")

    #plot forecast:
    ggplot(data=f.data,aes(x=date,y=hpa12,label=paste(percent(hpa12)),color=f,linetype=f))+
      geom_line(size=1.1)+
      scale_color_viridis(name="",discrete=T,end=0.75,direction=1)+
      geom_point(data=tail(f.data,1),color=viridis(10)[8],alpha=0.82,size=3)+
      geom_text(data=tail(f.data,1),color=viridis(10)[8],alpha=0.82,size=3,hjust=0,vjust=-1)+
      scale_y_continuous(label=percent)+
      coord_cartesian(xlim=c(as.Date("1976-01-01"),as.Date("2021-12-31")))+
      theme_minimal()+labs(x="",y="",title="Annual percentage change in house prices",
                           subtitle="Virginia in September, dotted line forecast from AR(2) model",
                           caption="@lenkiefer Source: Freddie Mac House Price Index")+
      guides(linetype=F)+
      theme(plot.caption=element_text(hjust=0),
            legend.position="top")

plot of chunk va-ar-recursion-3-dec4-2016,

Okay things look reasonable. These forecasts are sort of dumb, they don’t account for inflation or other factors, but based on the history of house prices they aren’t totally outlandish. What’s a little more interesting is to consider if we rolled back the clock, what these simple forecasts would have looked like.

Again, just using Virginia as an example, we can go back for each year since 1985, fit a forecasting model on history up to that point in time, and then project forward a few years. We could do it with a loop, but instead we’ll use Hadley’s approach described in Plotcon and store the results of each estimate as a dataframe nested inside a data frame. This structure will be more compact, and also make plotting with ggplot quite simple, no explicit loops involved.

Build a function

To get this to work we’re going to require a function with two inputs. First, it will take the name of state and second it will take a maximum date. Then it will construct the forecast for that state up to that time (as we did for Virginia above) and stack the forecasts.

fcst.d3<-function(s,dmax) { 
  #subset data based on state s and up to date dmax:
  df<-d.state[state==s & date<=dmax,c("state","date","hpa12"),with=F]
  test.ts<-xts(df$hpa12,df$date)
  test.out<-ar(test.ts,order.max=2)
  test.p<-predict(test.out, n.ahead = 4)
  test.p.ts<-xts(test.p$pred,seq(max(df$date)+years(1),max(df$date)+years(4),"1 year"))
  test.ts<-rbind(tail(test.ts,1),test.p.ts)
  f.data<-data.frame(state=s,date=index(test.ts), hpa12=coredata(test.ts))
  f.data$f<-ifelse(f.data$date>max(df$date),"forecast","actual")
  return(f.data)
}

Let’s use this function by adding a forecast based on data up to September 2005 to our original plot for Virginia:

# Examine forecast output
htmlTable(fcst.d3("VA","2005-09-01") %>% map_if(is.Date, as.character,format="%b-%Y") %>% map_if(is.numeric, round,3) %>%as.data.frame() , col.rgroup = c("none", "#F7F7F7"),caption="Virginia house price forecasts\nfrom simple AR(2) model",
          tfoot="Source: Freddie Mac House Price Index\nAR(2) forecast fit on data through Sep 2005")
Virginia house price forecasts from simple AR(2) model
date hpa12 state
1 Sep-2005 0.179 VA
2 Sep-2006 0.14 VA
3 Sep-2007 0.114 VA
4 Sep-2008 0.097 VA
5 Sep-2009 0.086 VA
Source: Freddie Mac House Price Index
AR(2) forecast fit on data through Sep 2005
# Create plot

ggplot(data=d.state[state=="VA",],aes(x=date,y=hpa12,label=paste(percent(hpa12))))+
  geom_line(color=viridis(10)[1],size=1.1)+
  geom_line(data=fcst.d3("VA","2005-09-01"), #use forecast function data
            linetype=2,color=viridis(10)[4],size=1.1)+
  geom_point(data=tail(d.state[state=="VA",],1),color=viridis(10)[1],alpha=0.82,size=3)+
  geom_text(data=tail(d.state[state=="VA",],1),color=viridis(10)[1],alpha=0.82,size=3,hjust=0,vjust=-1)+
  scale_y_continuous(label=percent)+
  coord_cartesian(xlim=c(as.Date("1976-01-01"),as.Date("2017-12-31")))+
  theme_minimal()+labs(x="",y="",title="Annual percentage change in house prices",
                       subtitle="Virginia in September, dotted line forecast from AR(2) model fit on data through 2005",
                       caption="@lenkiefer Source: Freddie Mac House Price Index")+
  theme(plot.caption=element_text(hjust=0))

plot of chunk va-plot-recursion-5-dec4-2016,

Here we can see that based on the history of hpa, a simple model would have expected some mean reversion back towards a long-run average, as depicted by the tentacle extending from the plot. What would it look like if we added a tentacle for each year? We could do that through a loop, or use the nesting described in Hadley’s talk. Let’s try that:

Nesting

The function below will allow us to next each prediction in our dataframe. For now, we’ll restrict the data just to Virginia, but soon we’ll add in all 50 states plus D.C. Turns out it just takes a couple lines of code!

fcsts.va<- d.state[year(date)>1984 & state=="VA", c("date","state","hpa12"),with=F] %>% 
           mutate(fcst=map2(state,date,fcst.d3))
head(fcsts.va,5)
##          date state      hpa12         fcst
## 1: 1985-09-01    VA 0.05019297 <data.frame>
## 2: 1986-09-01    VA 0.06988367 <data.frame>
## 3: 1987-09-01    VA 0.10303571 <data.frame>
## 4: 1988-09-01    VA 0.09598973 <data.frame>
## 5: 1989-09-01    VA 0.07379588 <data.frame>

Now we have a data frame with a set of data frames (one for each date) corresponding to forecasts beginning at the date and extending 4 years.

Now we can easily construct our tentacle plot. We use the unnest() function to expand the data.

p.data<-unnest(fcsts.va,fcst,.sep=".") #unnest the data for plotting
ggplot()+  
  geom_path(aes(x=fcst.date,y=fcst.hpa12,color=factor(year(date))),
                     data=unnest(fcsts.va,fcst,.sep=".")
            ,linetype=2 )    +
  
      geom_path(aes(x=date,y=hpa12),
                data=d.state[year(date)>1984 & state=="VA", c("date","state","hpa12"),with=F]
                ,size=1.05)+
  theme_minimal()+theme(legend.position="none")+  scale_y_continuous(label=percent)+
  labs(x="",y="",title="Annual percentage change in house prices",
                       subtitle="Virginia in September, dotted lines forecast from AR(2) model fit",
                       caption="@lenkiefer Source: Freddie Mac House Price Index\nAR(2) forecast fit on data up to point where solid line and dotted lines diverge")+
  theme(plot.caption=element_text(hjust=0))+

  coord_cartesian(xlim=c(as.Date("1985-01-01"),max(p.data$fcst.date)),ylim=c(-.1,.2))

plot of chunk va-plot-recursion-7-dec4-2016,

We can also roll over each state:

fcsts.st<- d.state[year(date)>1984, c("date","state","hpa12"),with=F] %>% 
           mutate(fcst=map2(state,date,fcst.d3))
           
p.data2<-unnest(fcsts.va,fcst,.sep=".")

ggplot()+  
  geom_path(aes(x=fcst.date,y=fcst.hpa12,color=factor(year(date))),
                     data=unnest(filter(fcsts.st,! state %in% c("DC","US.NSA","US.SA")),fcst,.sep=".")
            ,linetype=2,size=.75 )    +
  
      geom_path(aes(x=date,y=hpa12),
                data=d.state[year(date)>1984 & ! state %in% c("DC","US.NSA","US.SA") , c("date","state","hpa12"),with=F]
                ,size=.85)+
  theme_minimal()+theme(legend.position="none")+  scale_y_continuous(label=percent)+
  labs(x="",y="",title="Annual percentage change in house prices",
                       subtitle="Dotted lines forecast from AR(2) model fit",
                       caption="@lenkiefer Source: Freddie Mac House Price Index\nAR(2) forecast fit on data up to point where solid line and dotted lines diverge")+
  theme(plot.caption=element_text(hjust=0))+
  facet_wrap(~state,ncol=5)

plot of chunk va-plot-recursion-8-dec4-2016,

And we can make a gif looping through a few key states:

tentacle gifs

Going forward

This type of approach could be quite useful for a number of other applications. The map functions from purrr have a lot of potential uses I’m looking forward to trying out in the future. So far, I haven’t blown up my computer with an endless loop. If I can keep it that way, I’ll follow up in this space with some other applications.