Data swarms: Your firearms are useless against them!
18 August 2016
AUGUST IS ALMOST OVER, and it’s nearly back to school season. And that means one thing. No, not that we’re about to get a chance to watch the #1 NCAA football program of all time dominate the gridiron (though that’s awesome too). No, it’s data release season! A data swarm is on its way.
In preparation for all these data releases I’ve been firing up the viz terminal and getting my statistical software all stretched out. Good news is, there’s all kind of great innovations coming out to help me. Bad news is there’s so much to keep up with. In this post I’m going to try out some new techniques and work on some fundamentals.
Data wrangling with R
Out of expediency I’ve often used Excel for mundane data wrangling tasks (see here for an example), but I’ve been meaning to use R more.
I’m going to try to build from Jonathan’s examples and work out my data wrangling skills with an example using negative equity estimates from Zillow.
Wrangling some data
Zillow publishes a negative equity report that tracks what proportion of homeowners currently are “underwater” or owe more on their mortgage than what their home is currently worth. See a write-up of their latest release from Zillow Chief Economist @SvenjaGudell here. Note that the firm CoreLogic also has a report. The two reports are slightly different due to different underlying data and different, but similar, methodologies. Because the Zillow data is available in a handy .csv format downloadable from their website I’m going to use the Zillow data.
The Zillow data is not tidy, but it’s not too far from it. It’s a good version to practice on. Downloading the data you get a .csv file that looks like this when viewed in Excel. Note Zillow released the 2016Q2 report today, but the latest available data at time of writing is the 2016Q1 report
In the data file there are several descriptive columns and then the actual data showing the proportion of homeowners underwater according to Zillow’s calculations starting in the 13th column (M in the spreadsheet). We’ll want to tidy this column by gathering all the data columns (from columns 13 to 32 or column M to AF).
But after reading Jonathan’s code examples I see that I can try a slightly different approach to get to the same place.
Okay let’s use these data for something
Let’s examine trends in negative equity by state using these data. Once again we’ll use the R packaged tweenr and animate to make an animated gif. See my earlier post about tweenr for an introduction to tweenr, and more examples here and here.
Swarms and swarms
I spend a lot of time looking at the annual Home Mortgage Disclosure Act (HMDA) data. The publicly available data is a great source of information for what’s happened in the mortgage market in the past year. The data are only available with a lag. In September/October we’ll get the 2015 data.
The data are housed over at the Consumer Financial Protection Bureau webpage. They have a public API, but I haven’t seen anything written on using R with it. But they do have a nice summary file generator, which can be quite handy. But for this exercise we’ll work with the loan level data.
For this exercise we’re going to work with mortgage loan origination records from the 2014 HMDA data.
Every year there are a lot of mortgages originated. Some have even said that this year (2016) the total market will top $2 Trillion in originated mortgages. In 2014 there were over 5 million mortgage loans originated for 1-4 family dwellings and manufactured housing. For the following examples I downloaded a loan-level file including all 5 million + observations from the CFPB website. To make things marginally faster I restricted myself to conventional loans, bringing the raw count down to about 4 million loans.
I like to look at distributions in different ways, so I thought I would try them out. I was not disappointed, especially when I combined it with tweenr.
For this example, I’m going to use the beeswarm package to generate beeswarm plots shows the distribution of loan amount across states broken out by purchase and refinance loans. And we’ll use tweenr to animate the transitions.
Making the plot
I’ve pulled down a .csv file from the CFPB website that containes all the conventional purchase and refinance 1-4 family dwelling and manufactured housing mortgage originations in 2014. This link will take you to the CFPB webpage where you can download the file. It is about 2.6 GB so I wouldn’t recommend opening in Excel. Though I have done such a thing in the past.
But I’m not here to talk about the past (that’s a story for another post). I’m here to talk about beeswarms.