IN THIS POST I WANT TO DOCUMENT some R code I’ve recently been working on combining maps and distribution plots. As I discussed earlier lots of interesting data will be released in the fall and I want to be ready for it.
Some of these snippets can be recycled when the new data is available.
One area of data visualization with R I haven’t explored much is mapping. Part of this reason is because I’ve had other tools to use, but usually it’s because I’m in a hurry. Maybe because of prior experience with SPlus and the way ggplot is structured I’ve found making other statistical graphs (line charts, scatters, etc.) relatively easy in R. But maps have been a different story.
But there is a lot of cool stuff being done in R with maps. I have found recent posts by Kyle Walker and Julia Silge to be very helpful to get me started. For example this post by Kyle and this one by Julia have been my launching points for this analysis. I’m sure there are lots of other people doing interesting stuff–please tell me about it–but those two articles were useful for relative beginners such as myself.
For this post I’m going to return to the 2014 Home Mortgage Disclosure Act Data (HMDA) that you can get from the Consumer Financial Protection Bureau (CFPB) webpage and I discussed earlier.
The goal will be to create a choropleth map showing some summary statistics of the HMDA data and to integrate it with some other statistical analysis. One example we’ll try to build is this graph I posted on Twitter last night:
If you haven’t already go ahead and click on the follow button.
The map in the upper left corner is a choropleth map showing the median loan size by county for mortgage loans originated in California in 2014. The other charts are beeswarm plots showing the distribution of loan size for California as a whole (upper right corner) and for each of the metro areas (and a non-metro residual) located in California.
Building the maps
I’m going to be adapting the code from Julia’s post to make this map.
First let’s get the data we need. I’m going to be using the same HMDA data from before that you can download from the CFPB via this link.
The HMDA data in the .csv file includes state and county names, but not Federal Information Processing Series (FIPS) codes. FIPS codes are convenient for working with geographic data. Names are sometimes formatted differently (do they include the word “county” for example) so having FIPS codes would make our life easier. Fortunately the Census has a convenient lookup file we can use. This or a related file might well be included somewhere in some R package, even one I use, but didn’t see it.
Now that we’ve loaded our data and summarized it, let’s load some maps.
Let’s test to see if our map is working. We’re going to use geom_map() to make our maps. Let’s try to make a state map first:
All right! It’s alive. Now let’s try to make a more complicated map with some styling. Let’s make a choropleth map of the United States with each county colored according to the median loans size of mortgage loans originated in 2014.
Excellent! Now all we have to do is combine this map with our beeswarm plots and we’ll be good to go.
Subsetting map and adding distribution plot
The composite map is interesting, but hard to see beyond some general patterns (the coasts have high median loans sizes, while loan sizes in the Midwest and (non-coastal) Southeast tend to be smaller). Let’s zoom in on a particular state and add a plot showing the distribution of loan sizes.
We can loop through all states and make a gif out of it:
County Population Example
I began figuring out make maps in R by replicating Julia Silge’s excellent example. Then I combined tweenr and beeswarm plots to show how county population is distributed across states:
Code for county population example
Once again we’ll use the R package tweenr and animate to make an animated gif. See my earlier post about tweenr for an introduction to tweenr, and more examples here and here.
In order to get the tweenr animation to work, I need to ensure that each data frame I feed to the tween_states() function has the same number of observations. But states have different numbers of counties. Texas has the most with 254 counties.
We’ll set up a blank data frame based on Texas and index the county number. We’ll merge each state to this data frame, which will give us missing values for county numbers less than 254.
What hasn’t worked out so well (yet)
I’ve got a few other ideas that haven’t quite worked out. You can learn a lot from failures (especially your own), so I’ll include an example.
I wanted to compare the distribution of county population across states to the U.S. as a whole. So I thought I’d use a beeswarm plot (with geom_quasirandom). I wanted the state distribution to literally fall out of the national distribution using an animation. I got this far:
Technically I had some trouble getting the labels on the plot (which state is dropping), but I think there’s more wrong with it than that. Maybe we can fix it up for a later post.