2016/08/23

dataviz / R / maps / demographics

SOMETIMES YOU ACTUALLY LEARN SOMETHING from social media. Today on Twitter I happened across this Tweet via @kyle_e_walker:

Seems somebody posted estimates of the U.S. population by county (defined by 2010 county definitions) going back to 1790. This is a perfect dataset to practice my mapping with R.

The data are conveniently available via the University of Minnesota. The data come in a nice spreadsheet that we can easily import into R and manipulate.

We’re going to build this:

Getting started

Let’s begin my loading libraries. I don’t recall which of these libraries are essential for what follows, but here’s what I was working with.

{% highlight r #load libraries library(readxl) library(data.table) library(dplyr) library(tidyr) library(ggbeeswarm) library(viridis) library(ggplot2) library(scales) library(ggthemes) library(tweenr) library(purrr) library(animation) library(acs) library(reshape2) library(stringr) library(ggalt) library(rgeos) library(maptools) library(albersusa) library(broom) library(ineq) #used for concentration indices {% endhighlight

Now we can import the data. I’ve saved the spreadsheet to my data directory.

There are two excel worksheets we are interested in using. The c2010_hist_pops worksheet contains population estimates from 1790 to 2010 arranged with years as columns and each county going down. The national_shares worksheet contains the share of total U.S. population for each county.

Let’s load the data and take a look:

{% highlight r library(readxl) library(“htmlTable”) popDF<-read_excel(“data/county2010_hist_pops.xlsx”,sheet=“c2010_hist_pops”) shareDF<-read_excel(“data/county2010_hist_pops.xlsx”,sheet=“national_shares”)

popDF.print<-popDF shareDF.print<-shareDF

#round the decimals for printing is.num <- sapply(popDF.print, is.numeric) popDF.print[is.num] <- lapply(popDF.print[is.num], round, 0) is.num <- sapply(shareDF.print, is.numeric) shareDF.print[is.num] <- lapply(shareDF.print[is.num], round, 2)

make tables for viewing

htmlTable(head(popDF.print), col.rgroup = c(“none”, “#F7F7F7”),caption=“Population table”, tfoot="†c2010_hist_pops worksheet") {% endhighlight

{% highlight r htmlTable(head(shareDF.print), col.rgroup = c(“none”, “#F7F7F7”),caption=“Population share table”, tfoot="† national_shares_worksheet") {% endhighlight

In order to get our code to work, we need to gather the data columns and created some year variables. If you scroll all the way to the right you’ll also see that the last column has a slightly different name (pop2010 instead of epop2000), which will have to account for in our code.

{% highlight r #set up the population data frame: popDF<-popDF %>% gather(year.of,“pop”,5:27) #create years, adjusting for last column: popDF<-mutate(popDF,year=ifelse(substr(year.of,1,1)==“e”,substr(year.of,5,9),substr(year.of,4,8))) popDF<-rename(popDF,fips=GEOID10 )

and shares data frame:

shareDF<-shareDF %>% gather(year.of,“share”,5:27) shareDF<-mutate(shareDF,year=substr(year.of,6,10)) shareDF<-rename(shareDF,fips=GEOID10 )

and take a look:

htmlTable(head(popDF), col.rgroup = c(“none”, “#F7F7F7”),caption=“Population table-tidy version”, tfoot="†c2010_hist_pops worksheet") {% endhighlight

All right! Now we’re on easy street. We just have to plug these data into the maps like we did last time. We’ll add some flourishes to get the animation to work.

{% highlight r #set up our map stuff: states<-usa_composite() smap<-fortify(states,region=“fips_state”) smap.all<-smap counties <- counties_composite() #get data: counties@data <- left_join(counties@data, popDF.i, by = “fips”) counties@data <- left_join(counties@data, shareDF.i, by = “fips”)

cmap <- fortify(counties_composite(), region=“fips”) cmap$state<-substr(cmap$id,1,2) cmap$county<-substr(cmap$id,3,5) cmap$fips<-paste0(cmap$state,cmap$county) cmap.all<-cmap

#we want to cite these data appropriately. The caption gets long so we’ll follow @hrbrmstr #https://www.r-bloggers.com/supreme-annotations/ This post has a bunch of tricks, like the caption one I use here: mycaption<-"@lenkiefer Source: Schroeder, Jonathan P. (2016). Historical Population Estimates for 2010 U.S. States, Counties and Metro/Micro Areas, 1790-2010. Retrieved from the Data Repository for the University of Minnesota, http://doi.org/10.13020/D6XW2H." mycaption <- paste0(strwrap(mycaption, 130), sep="", collapse="\n") {% endhighlight

We could just loop through the data subsetting on years. That would be easy, but the animation would be choppy. We want it to be smoother so we’ll need to use tweenr. See my earlier post about tweenr for an introduction to tweenr, and more examples here and here.

To make this work, we’ll build two functions. One to create the maps and distribution plot and a second function to create the data the map/plot function will use.

{% highlight r source(‘code/multiplot.R’)

function for combining graphs see: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

myplotf<-function(popDF.i){ popDF.i<-data.table(popDF.i) #convert to data.table for later counties <- counties_composite() #get data: counties@data <- left_join(counties@data, popDF.i, by = “fips”)

map1<- ggplot() + geom_map(data = cmap.all, map = cmap.all, aes(x = long, y = lat, map_id = id), #color = “#2b2b2b”, size = 0.05, fill = NA) + color = “lightgray”, size = 0.05, fill = NA) + geom_map(data = counties@data, map = cmap.all, aes(fill =log(pop), map_id = fips), color = NA) + geom_map(data = smap.all, map = smap.all, aes(x = long, y = lat, map_id = id), color = “lightgray”, size = .25, fill = NA) + theme_map( base_size = 12) + theme(plot.title=element_text( size = 16, margin=margin(b=10))) + theme(plot.subtitle=element_text(size = 14, margin=margin(b=-20))) + theme(plot.caption=element_text(size = 9, margin=margin(t=-15))) + coord_proj(us_laea_proj) + labs(title="",subtitle="" ) + scale_fill_viridis(name=“Population\nlog scale\n”, discrete=F,option=“D”,end=0.95,direction=-1,limits=c(log(100),log(1e7)), breaks=c(log(100),log(10000),log(100000),log(1e7)), labels=c(“100”,“10,000”,“100,000”,“1,000,000”)) + theme(legend.position = “right”) +theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+ labs(x=“Population (log scale)",y="”, #subtitle=“Each dot represents 1 county”, title=paste0(“County Population Distributions in “,popDF.i[1]$year))

#remove counties with 0 pop as they mess up the log scale popDF.i[,pop:=ifelse(pop>0,pop,NA)]

graph1<- ggplot(popDF.i[STATE !=“District Of Columbia” ],
#we need 50 states for the columns to look pretty so D.C. has to go aes(y="",x=pop,color=log(pop)))+ theme_minimal()+
theme(plot.title=element_text(size=14))+theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+ theme(plot.margin=unit(c(0.25,0.25,0.25,0.25),“cm”))+ theme(legend.position = “none”)+ geom_quasirandom(alpha=0.65,size=1) + #I like the quasirandom beeswarm option scale_color_viridis(name=“Pop\nlog scale\n”, discrete=F,option=“D”,end=0.95,direction=-1,limits=c(log(100),log(1e7)), breaks=c(log(1000),log(100000),log(1e7)), labels=c(“1000”,“100,000”,“10M”)) + theme(axis.text.x = element_text(size=4))+ #shrink axis text size scale_x_log10( labels=c(“10K”,“1M”),limits=c(1000,1e7),breaks=c(10000,100000))+ labs(x=“Population”,y="”, subtitle=“Each dot represents 1 county”, #title=“County Population Distributions”, caption=mycaption )+ facet_wrap(~STATE,ncol=10)

m<-multiplot(map1,graph1,layout=matrix(c(1,1,2,2,2), nrow=5, byrow=TRUE)) }

y.list<-unique(popDF$year) #get a list of years yy<-y.list[12] #pick a year 1900 in this case popDF.i<-subset(popDF,year==yy) myplotf(popDF.i) #try it out! {% endhighlight

plot of chunk fig-graph-1

Okay, now that we have the graph, we just have to set up the tween_states() call and loop through the animation.

{% highlight r myf <- function(yy){ my.out<-popDF[year==yy,] #subset data based on year my.out %>% map_if(is.character, as.factor) %>% as.data.frame() ->my.out #need to convert characters to factors for tweenr return(data.frame(my.out))}

my.list<-lapply(y.list,myf)

tf <- tween_states(my.list,tweenlength= 3, statelength=1, ease=rep(‘cubic-in-out’,2),nframes=23*4) tf<-data.table(tf) #data.table useful for subsetting N<-max(tf$.frame) #number of frames

#create the animation oopt = ani.options(interval = 0.15) saveGIF({for (i in 1:N) { myplotf(tf[.frame==i]) ani.pause() } },movie.name=“pop gif2.gif”,ani.width = 900, ani.height = 700) {% endhighlight

And that’s it. Run through this code and you’ll get the animated gif I posted above.

But wait! There’s more

Let’s make a few other static graphs to explore some other features of this fascintating dataset.

Small multiple map

Let’s construct a small multiple map, showing how the share of population has varied by county across time.

{% highlight r p2<-subset(popDF,year %in% c(“1810”,“1860”,“1910”,“1960”,“1980”,“2010”)) counties <- counties_composite() #get data: counties@data <- left_join(counties@data, p2, by = “fips”)

map3<- ggplot() + geom_map(data = cmap.all, map = cmap.all, aes(x = long, y = lat, map_id = id), color = “lightgray”, size = 0.05, fill = NA) + geom_map(data = counties@data, map = cmap.all, aes(fill =log(pop), map_id = fips), color = NA) + geom_map(data = smap.all, map = smap.all, aes(x = long, y = lat, map_id = id), color = “lightgray”, size = .25, fill = NA) + theme_map( base_size = 12) + theme(plot.title=element_text( size = 16, margin=margin(b=10))) + theme(plot.subtitle=element_text(size = 14, margin=margin(b=-20))) + theme(plot.caption=element_text(size = 9, margin=margin(t=-15))) + coord_proj(us_laea_proj) + labs(title="",subtitle=”" ) + scale_fill_viridis(name=“Population\nlog scale\n”, discrete=F,option=“D”,end=0.95,direction=-1,limits=c(log(100),log(1e7)), breaks=c(log(100),log(10000),log(100000),log(1e7)), labels=c(“100”,“10,000”,“100,000”,“1,000,000”)) + theme(legend.position = “right”) +theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+ labs(x=“Population (log scale)”, caption=mycaption, title=“U.S. County Population by Year\n”)+facet_wrap(~year)

map3 {% endhighlight

plot of chunk fig-graph-3

Aren’t ggplot2() and geom_map() just great?

How concentrated is the US population?

Several maps have been circulating showing how 1/2 of some country’s population lives in a tiny part of the physical area of a country. Let’s use the share data to investigate how the concentration of population might have changed from 1790 to 2010.

{% highlight r #load shares data shareDF<-read_excel(“data/county2010_hist_pops.xlsx”,sheet=“national_shares”) shareDF<-shareDF %>% gather(year.of,“share”,5:27) shareDF<-mutate(shareDF,year=substr(year.of,6,10)) shareDF<-rename(shareDF,fips=GEOID10 ) {% endhighlight

{% highlight text

Error in rename(shareDF, fips = GEOID10): unused argument (fips = GEOID10)

{% endhighlight

{% highlight r s2<-subset(shareDF,year %in% c(“1810”,“1860”,“1910”,“1960”,“1980”,“2010”)) s2<-data.table(s2)[order(year,-share)] #order by share of total population s2<-s2[, id := 1:.N, by = year][order(year,id)] #create an index library(zoo) s2<-s2[, “:="(share.run=rollapply(share,id,sum,align=“right”)), by=year] #compute running total by share

ggplot(s2,
#we need 50 states for the columns to look pretty so D.C. has to go aes(y=share.run,x=id,color=year,label=paste0(” “,COUNTY,” County, “,STATE,"\n “,round(share.run,1),"% of total”)))+ geom_hline(yintercept=50,linetype=2,color=“black”,size=0.5)+ theme_minimal()+ geom_line() +facet_wrap(~year)+ geom_text(data=s2[id==1],hjust=0,nudge_x=20,size=3,nudge_y=5)+ theme(legend.position = “none”)+ theme(plot.title=element_text(size=14))+theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+ theme(plot.margin=unit(c(0.25,0.25,0.25,0.25),“cm”))+ theme(legend.position = “none”)+ labs(y=“Cumulative share of population (percent)",x=“County population rank”, title=“County Population Cumulative Share”, caption=mycaption ) {% endhighlight

plot of chunk fig-graph-4

This plot shows how the cumulative population share by county (sorted by largest to smallest county) has varied. In 1810, Philadelphia county accounted for 1.5% of the total U.S. population. In 2010, Los Angeles County was the largest and accounted for 3.2% of the total U.S. population.

Let’s find the county where the cumulative line crosses at 50 where it ranked.

{% highlight r print.table<-merge(s2[ share.run<=50, list(crossing.county.number=max(id)), by=year], s2[ share>0,list(total.county.number=.N), by=year], by=“year”) print.table[,":="(percent.for.50.percent=percent(crossing.county.number/total.county.number))]

htmlTable(print.table, header=c(“Year”,“Rank at 50%”,“Total Counties”,"% for 50%"), col.rgroup = c(“none”, “#F7F7F7”),caption=“How many counties comprise 50% of U.S. population?”, tfoot="†Counties ranked (descending) in terms of share of total population\nTable revised 8/25/2016 to correct # of counties\n(some omitted due to rounding in earlier version)") {% endhighlight

U.S. county population: 1790-2010