04 December 2016

Introduction

I HAVE BEEN WATCHING SOME VIDEOS of Plotcon 2016. All of the videos I’ve watched are worth watching (check out the playlist), but I was particularly interested in this one from Hadley Wickham:

Among other things Hadley talks about the idea of nesting data frames, models and model results within a data frame. That idea struck me as something that could be quite useful and not at all something that could lead to explosive increase in the size of data frames or unending loops.

We’ll try these ideas out in a simple, stylized forecasting exercise. We’ll use R to explore and model.

The data

I happen to have some data handy that’s perfect for this exercise. It’s the Freddie Mac House Price Index I’ve been using for my series of Visual Meditations on House Prices.

We’re going to use data house prices for the United States, each of the 50 states and the District of Columbia collected in the following file:

Important note: Though I’m going to use house prices as an illustrative example, this shouldn’t be interpreted as my recommendation for a reasonable way to model house prices in any way. This is just for fun, and trying out some coding things.

The strategy

Here’s what we’re going to do. We’re going to take our house price data, a dataset with monthly observations on the house price index from January 1975 to September 2016, and construct a simple forecast for house price growth (year-over-year percent change) for each state rolling forward from 1985.

To make things simple we’ll subset the data to just include observations in September of each (the last month available for 2016). That will reduce the number of observations and get rid of overlapping time series observations.

We can apply the techniques outlined in Hadley’s Plotcon talk to organize our results in a single data frame.

The old way

What I would usually do in this situation, is construct a series of loops to iterate over each state and each time period. Something like this (where forecast.function, and stack.data.function are some functions that compute forecasts and organize the output data respectively):

Nest with map

Now we can avoid the loops by using the map or map2 functions from the purrr package. Instead of loops, we can take a dataframe with a list of states and dates and then use the map2 function to store our model results in the same dataframe. Like so:

Besides being more efficient to write, this will result in a nice structure and we don’t have to worry about index numbers and aligning things if we want to add or subtract rows.

Example using house prices

Let’s load packages, import the data from the text file above, and do some data manipulations:

Now that we have our data loaded let’s take a peek at the structure:

 date state hpi year month type hpa12 State House Price Data 1 Sep-1976 AK 41.953 1976 9 State 0.077 2 Sep-1976 AL 38.775 1976 9 State 0.073 3 Sep-1976 AR 41.717 1976 9 State 0.084 4 Sep-1976 AZ 30.269 1976 9 State 0.039 5 Sep-1976 CA 19.974 1976 9 State 0.162 6 Sep-1976 CO 22.027 1976 9 State 0.066 7 Sep-1976 CT 27.238 1976 9 State 0.06 8 Sep-1976 DC 22.244 1976 9 State 0.101 9 Sep-1976 DE 28.837 1976 9 State 0.011 10 Sep-1976 FL 34.051 1976 9 State 0.036 Source: Freddie Mac House Price Index

What we want to do is forecast the house price appreciation (“hpa12”: year-over-year percent change in index) at different points in time.

Working on a single state

Before we get into the nesting business, let’s just do it for a single state, for a single month. Let’s extract the history of house price growth (hereafter HPA) in Virginia from 1976 through 2016 in September of each year:

House prices in Virginia are growing by just under 3 percent year-over-year in September 2016, up from the lows of 2009, but below the middle part of last decade and also a bit below the long-run average.

Let’s construct a simple times series forecasting model for house prices (see my note above, this is just for illustration) based onthe history of HPA. There are many models we could use, but one of the simplest is an Autoregressive Model.We can code this pretty easily in R, but working with dates can be tricky. I’ve tried a couple things, but xts works for me.

We use a simple autoregressive model fit to up to two lags of HPA.

There’s quite a bit of persistence in the HPA series, reflected in the coefficients. We might be concerned about stationarity with this series. We can use some nice functions from Rob Hyndman to check:

Now we can use this simple AR model to forecast house prices. The code below will stack the predictions to the input data. Then we’ll make a simple forecast plot:

Okay things look reasonable. These forecasts are sort of dumb, they don’t account for inflation or other factors, but based on the history of house prices they aren’t totally outlandish. What’s a little more interesting is to consider if we rolled back the clock, what these simple forecasts would have looked like.

Again, just using Virginia as an example, we can go back for each year since 1985, fit a forecasting model on history up to that point in time, and then project forward a few years. We could do it with a loop, but instead we’ll use Hadley’s approach described in Plotcon and store the results of each estimate as a dataframe nested inside a data frame. This structure will be more compact, and also make plotting with ggplot quite simple, no explicit loops involved.

Build a function

To get this to work we’re going to require a function with two inputs. First, it will take the name of state and second it will take a maximum date. Then it will construct the forecast for that state up to that time (as we did for Virginia above) and stack the forecasts.

Let’s use this function by adding a forecast based on data up to September 2005 to our original plot for Virginia:

 date hpa12 state Virginia house price forecasts from simple AR(2) model 1 Sep-2005 0.179 VA 2 Sep-2006 0.14 VA 3 Sep-2007 0.114 VA 4 Sep-2008 0.097 VA 5 Sep-2009 0.086 VA Source: Freddie Mac House Price Index AR(2) forecast fit on data through Sep 2005

Here we can see that based on the history of hpa, a simple model would have expected some mean reversion back towards a long-run average, as depicted by the tentacle extending from the plot. What would it look like if we added a tentacle for each year? We could do that through a loop, or use the nesting described in Hadley’s talk. Let’s try that:

Nesting

The function below will allow us to next each prediction in our dataframe. For now, we’ll restrict the data just to Virginia, but soon we’ll add in all 50 states plus D.C. Turns out it just takes a couple lines of code!

Now we have a data frame with a set of data frames (one for each date) corresponding to forecasts beginning at the date and extending 4 years.

Now we can easily construct our tentacle plot. We use the unnest() function to expand the data.

We can also roll over each state:

And we can make a gif looping through a few key states:

Going forward

This type of approach could be quite useful for a number of other applications. The map functions from purrr have a lot of potential uses I’m looking forward to trying out in the future. So far, I haven’t blown up my computer with an endless loop. If I can keep it that way, I’ll follow up in this space with some other applications.