23 August 2016

SOMETIMES YOU ACTUALLY LEARN SOMETHING from social media. Today on Twitter I happened across this Tweet via @kyle_e_walker:

Seems somebody posted estimates of the U.S. population by county (defined by 2010 county definitions) going back to 1790. This is a perfect dataset to practice my mapping with R.

The data are conveniently available via the University of Minnesota. The data come in a nice spreadsheet that we can easily import into R and manipulate.

We’re going to build this:

population map

Getting started

Let’s begin my loading libraries. I don’t recall which of these libraries are essential for what follows, but here’s what I was working with.

#load libraries
library(readxl)
library(data.table)
library(dplyr)
library(tidyr)
library(ggbeeswarm)
library(viridis)
library(ggplot2)
library(scales)
library(ggthemes)
library(tweenr)
library(purrr)
library(animation)
library(acs)
library(reshape2)
library(stringr)
library(ggalt)
library(rgeos)
library(maptools)
library(albersusa)
library(broom)
library(ineq) #used for concentration indices

Now we can import the data. I’ve saved the spreadsheet to my data directory.

There are two excel worksheets we are interested in using. The c2010_hist_pops worksheet contains population estimates from 1790 to 2010 arranged with years as columns and each county going down. The national_shares worksheet contains the share of total U.S. population for each county.

Let’s load the data and take a look:

library(readxl)
library("htmlTable")
popDF<-read_excel("data/county2010_hist_pops.xlsx",sheet="c2010_hist_pops")
shareDF<-read_excel("data/county2010_hist_pops.xlsx",sheet="national_shares")

popDF.print<-popDF
shareDF.print<-shareDF

#round the decimals for printing
is.num <- sapply(popDF.print, is.numeric)
popDF.print[is.num] <- lapply(popDF.print[is.num], round, 0)
is.num <- sapply(shareDF.print, is.numeric)
shareDF.print[is.num] <- lapply(shareDF.print[is.num], round, 2)

# make tables for viewing
htmlTable(head(popDF.print), col.rgroup = c("none", "#F7F7F7"),caption="Population table",
          tfoot="&dagger;c2010_hist_pops worksheet")
Population table
GISJOIN GEOID10 STATE COUNTY epop1790 epop1800 epop1810 epop1820 epop1830 epop1840 epop1850 epop1860 epop1870 epop1880 epop1890 epop1900 epop1910 epop1920 epop1930 epop1940 epop1950 epop1960 epop1970 epop1980 epop1990 epop2000 pop2010
1 G0100010 01001 Alabama Autauga 0 0 0 2313 6332 7684 8049 8968 11623 13108 13330 17915 20038 18908 19694 20977 18186 18739 24460 32259 34222 43671 54571
2 G0100030 01003 Alabama Baldwin 0 22 26 646 1792 2263 3385 5774 6004 8603 8941 13194 18178 20730 28289 32324 40997 49088 59382 78556 98280 140415 182265
3 G0100050 01005 Alabama Barbour 0 0 0 428 2173 11163 21938 28902 29298 33979 34898 35152 32728 32067 32425 32722 28892 24700 22543 24756 25417 29038 27457
4 G0100070 01007 Alabama Bibb 0 0 0 2908 6144 7088 8526 10188 7512 9549 13948 18650 22884 23236 20780 20155 17987 14357 13812 15723 16576 19889 22915
5 G0100090 01009 Alabama Blount 0 0 0 1153 2418 4346 5637 8252 7377 13357 19362 20445 21456 25538 28020 29490 28975 25449 26853 36459 39248 51022 57322
6 G0100110 01011 Alabama Bullock 0 0 0 611 2113 8126 14665 18727 24500 29097 27077 31944 30196 25333 20016 19810 16054 13462 11824 10596 11042 11626 10914
†c2010_hist_pops worksheet
htmlTable(head(shareDF.print), col.rgroup = c("none", "#F7F7F7"),caption="Population share table",
          tfoot="&dagger; national_shares_worksheet")
Population share table
GISJOIN GEOID10 STATE COUNTY share1790 share1800 share1810 share1820 share1830 share1840 share1850 share1860 share1870 share1880 share1890 share1900 share1910 share1920 share1930 share1940 share1950 share1960 share1970 share1980 share1990 share2000 share2010
1 G0100010 01001 Alabama Autauga 0 0 0 0.02 0.05 0.05 0.03 0.03 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02
2 G0100030 01003 Alabama Baldwin 0 0 0 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.03 0.03 0.04 0.05 0.06
3 G0100050 01005 Alabama Barbour 0 0 0 0 0.02 0.07 0.09 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01
4 G0100070 01007 Alabama Bibb 0 0 0 0.03 0.05 0.04 0.04 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01
5 G0100090 01009 Alabama Blount 0 0 0 0.01 0.02 0.03 0.02 0.03 0.02 0.03 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.02 0.02 0.02 0.02
6 G0100110 01011 Alabama Bullock 0 0 0 0.01 0.02 0.05 0.06 0.06 0.06 0.06 0.04 0.04 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0 0 0 0
† national_shares_worksheet

In order to get our code to work, we need to gather the data columns and created some year variables. If you scroll all the way to the right you’ll also see that the last column has a slightly different name (pop2010 instead of epop2000), which will have to account for in our code.

#set up the population data frame:
popDF<-popDF %>% gather(year.of,"pop",5:27) 
#create years, adjusting for last column:
popDF<-mutate(popDF,year=ifelse(substr(year.of,1,1)=="e",substr(year.of,5,9),substr(year.of,4,8)))
popDF<-rename(popDF,fips=GEOID10 )

# and shares data frame:
shareDF<-shareDF %>% gather(year.of,"share",5:27) 
shareDF<-mutate(shareDF,year=substr(year.of,6,10))
shareDF<-rename(shareDF,fips=GEOID10 )

# and take a look:
htmlTable(head(popDF), col.rgroup = c("none", "#F7F7F7"),caption="Population table-tidy version",
          tfoot="&dagger;c2010_hist_pops worksheet")
Population table-tidy version
GISJOIN fips STATE COUNTY year.of pop year
1 G0100010 01001 Alabama Autauga epop1790 0 1790
2 G0100030 01003 Alabama Baldwin epop1790 0 1790
3 G0100050 01005 Alabama Barbour epop1790 0 1790
4 G0100070 01007 Alabama Bibb epop1790 0 1790
5 G0100090 01009 Alabama Blount epop1790 0 1790
6 G0100110 01011 Alabama Bullock epop1790 0 1790
†c2010_hist_pops worksheet

All right! Now we’re on easy street. We just have to plug these data into the maps like we did last time. We’ll add some flourishes to get the animation to work.

#set up our map stuff:
states<-usa_composite()
smap<-fortify(states,region="fips_state")
smap.all<-smap
counties <- counties_composite()
#get data:
counties@data <- left_join(counties@data, popDF.i, by = "fips")
counties@data <- left_join(counties@data, shareDF.i, by = "fips")

cmap <- fortify(counties_composite(), region="fips")
cmap$state<-substr(cmap$id,1,2)
cmap$county<-substr(cmap$id,3,5)
cmap$fips<-paste0(cmap$state,cmap$county)
cmap.all<-cmap

#we want to cite these data appropriately.  The caption gets long so we'll follow @hrbrmstr 
#https://www.r-bloggers.com/supreme-annotations/ This post has a bunch of tricks, like the caption one I use here:
mycaption<-"@lenkiefer Source: Schroeder, Jonathan P. (2016). Historical Population Estimates for 2010 U.S. States, Counties and Metro/Micro Areas, 1790-2010. Retrieved from the Data Repository for the University of Minnesota, http://doi.org/10.13020/D6XW2H."
mycaption <- paste0(strwrap(mycaption, 130), sep="", collapse="\n")

We could just loop through the data subsetting on years. That would be easy, but the animation would be choppy. We want it to be smoother so we’ll need to use tweenr. See my earlier post about tweenr for an introduction to tweenr, and more examples here and here.

To make this work, we’ll build two functions. One to create the maps and distribution plot and a second function to create the data the map/plot function will use.

source('code/multiplot.R')
# function for combining graphs see: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

myplotf<-function(popDF.i){
  popDF.i<-data.table(popDF.i)  #convert to data.table for later
  counties <- counties_composite()
  #get data:
  counties@data <- left_join(counties@data, popDF.i, by = "fips")
  
map1<-
  ggplot() +
  geom_map(data = cmap.all, map = cmap.all,
           aes(x = long, y = lat, map_id = id),
           #color = "#2b2b2b", size = 0.05, fill = NA) +
  color = "lightgray", size = 0.05, fill = NA) +
  geom_map(data = counties@data, map = cmap.all,
           aes(fill =log(pop), map_id = fips),
           color = NA) +
  geom_map(data = smap.all, map = smap.all,
           aes(x = long, y = lat, map_id = id),
           color = "lightgray", size = .25, fill = NA) +
  theme_map( base_size = 12) +
  theme(plot.title=element_text( size = 16, margin=margin(b=10))) +
  theme(plot.subtitle=element_text(size = 14, margin=margin(b=-20))) +
  theme(plot.caption=element_text(size = 9, margin=margin(t=-15))) +
  coord_proj(us_laea_proj) +   labs(title="",subtitle="" ) +
  scale_fill_viridis(name="Population\nlog scale\n",
                      discrete=F,option="D",end=0.95,direction=-1,limits=c(log(100),log(1e7)),
                      breaks=c(log(100),log(10000),log(100000),log(1e7)),
                      labels=c("100","10,000","100,000","1,000,000")) +
  theme(legend.position = "right") +theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  labs(x="Population (log scale)",y="",
       #subtitle="Each dot represents 1 county",
       title=paste0("County Population Distributions in ",popDF.i[1]$year))

#remove counties with 0 pop as they mess up the log scale
popDF.i[,pop:=ifelse(pop>0,pop,NA)]

graph1<-
  ggplot(popDF.i[STATE !="District Of Columbia" ],  
         #we need 50 states for the columns to look pretty so D.C. has to go
         aes(y="",x=pop,color=log(pop)))+
  theme_minimal()+    
  theme(plot.title=element_text(size=14))+theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  theme(plot.margin=unit(c(0.25,0.25,0.25,0.25),"cm"))+  theme(legend.position = "none")+
  geom_quasirandom(alpha=0.65,size=1)  +  #I like the quasirandom beeswarm option
  scale_color_viridis(name="Pop\nlog scale\n",
                      discrete=F,option="D",end=0.95,direction=-1,limits=c(log(100),log(1e7)),
                      breaks=c(log(1000),log(100000),log(1e7)),
                      labels=c("1000","100,000","10M")) +
    theme(axis.text.x = element_text(size=4))+  #shrink axis text size
  scale_x_log10( labels=c("10K","1M"),limits=c(1000,1e7),breaks=c(10000,100000))+
  labs(x="Population",y="",
       subtitle="Each dot represents 1 county",
       #title="County Population Distributions",
       caption=mycaption  )+
  facet_wrap(~STATE,ncol=10)

m<-multiplot(map1,graph1,layout=matrix(c(1,1,2,2,2), nrow=5, byrow=TRUE))
}

y.list<-unique(popDF$year) #get a list of years
yy<-y.list[12]  #pick a year 1900 in this case
popDF.i<-subset(popDF,year==yy)
myplotf(popDF.i)  #try it out!

plot of chunk fig-graph-1

Okay, now that we have the graph, we just have to set up the tween_states() call and loop through the animation.

myf <- function(yy){
  my.out<-popDF[year==yy,]  #subset data based on year
  my.out %>% map_if(is.character, as.factor) %>% as.data.frame() ->my.out  #need to convert characters to factors for tweenr
  return(data.frame(my.out))}

my.list<-lapply(y.list,myf)

tf <- tween_states(my.list,tweenlength= 3, statelength=1, ease=rep('cubic-in-out',2),nframes=23*4)
tf<-data.table(tf) #data.table useful for subsetting
N<-max(tf$.frame)  #number of frames

#create the animation
oopt = ani.options(interval = 0.15)
saveGIF({for (i in 1:N) {
  myplotf(tf[.frame==i])
  ani.pause()  }
  },movie.name="pop gif2.gif",ani.width = 900, ani.height = 700)

And that’s it. Run through this code and you’ll get the animated gif I posted above.

But wait! There’s more

Let’s make a few other static graphs to explore some other features of this fascintating dataset.

Small multiple map

Let’s construct a small multiple map, showing how the share of population has varied by county across time.

p2<-subset(popDF,year %in% c("1810","1860","1910","1960","1980","2010"))
counties <- counties_composite()
  #get data:
  counties@data <- left_join(counties@data, p2, by = "fips")
  
map3<-
  ggplot() +
  geom_map(data = cmap.all, map = cmap.all,
           aes(x = long, y = lat, map_id = id),
           color = "lightgray", size = 0.05, fill = NA) +
  geom_map(data = counties@data, map = cmap.all,
           aes(fill =log(pop), map_id = fips),
           color = NA) +
  geom_map(data = smap.all, map = smap.all,
           aes(x = long, y = lat, map_id = id),
           color = "lightgray", size = .25, fill = NA) +
  theme_map( base_size = 12) +
  theme(plot.title=element_text( size = 16, margin=margin(b=10))) +
  theme(plot.subtitle=element_text(size = 14, margin=margin(b=-20))) +
  theme(plot.caption=element_text(size = 9, margin=margin(t=-15))) +
  coord_proj(us_laea_proj) +   labs(title="",subtitle="" ) +
  scale_fill_viridis(name="Population\nlog scale\n",
                      discrete=F,option="D",end=0.95,direction=-1,limits=c(log(100),log(1e7)),
                      breaks=c(log(100),log(10000),log(100000),log(1e7)),
                      labels=c("100","10,000","100,000","1,000,000")) +
  theme(legend.position = "right") +theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  labs(x="Population (log scale)", caption=mycaption,
       title="U.S. County Population by Year\n")+facet_wrap(~year)

map3

plot of chunk fig-graph-3

Aren’t ggplot2() and geom_map() just great?

How concentrated is the US population?

Several maps have been circulating showing how 1/2 of some country’s population lives in a tiny part of the physical area of a country. Let’s use the share data to investigate how the concentration of population might have changed from 1790 to 2010.

#load shares data
shareDF<-read_excel("data/county2010_hist_pops.xlsx",sheet="national_shares")
shareDF<-shareDF %>% gather(year.of,"share",5:27) 
shareDF<-mutate(shareDF,year=substr(year.of,6,10))
shareDF<-rename(shareDF,fips=GEOID10 )
## Error in rename(shareDF, fips = GEOID10): unused argument (fips = GEOID10)
s2<-subset(shareDF,year %in% c("1810","1860","1910","1960","1980","2010"))
s2<-data.table(s2)[order(year,-share)]  #order by share of total population
s2<-s2[, id := 1:.N, by = year][order(year,id)]  #create an index
library(zoo)
s2<-s2[, ":="(share.run=rollapply(share,id,sum,align="right")), by=year] #compute running total by share


  ggplot(s2,  
         #we need 50 states for the columns to look pretty so D.C. has to go
         aes(y=share.run,x=id,color=year,label=paste0(" ",COUNTY," County, ",STATE,"\n ",round(share.run,1),"% of total")))+
    geom_hline(yintercept=50,linetype=2,color="black",size=0.5)+
  theme_minimal()+   geom_line() +facet_wrap(~year)+
  geom_text(data=s2[id==1],hjust=0,nudge_x=20,size=3,nudge_y=5)+
    theme(legend.position = "none")+
  theme(plot.title=element_text(size=14))+theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  theme(plot.margin=unit(c(0.25,0.25,0.25,0.25),"cm"))+  theme(legend.position = "none")+
      labs(y="Cumulative share of population (percent)",x="County population rank",
       title="County Population Cumulative Share",
       caption=mycaption  )

plot of chunk fig-graph-4

This plot shows how the cumulative population share by county (sorted by largest to smallest county) has varied. In 1810, Philadelphia county accounted for 1.5% of the total U.S. population. In 2010, Los Angeles County was the largest and accounted for 3.2% of the total U.S. population.

Let’s find the county where the cumulative line crosses at 50 where it ranked.

print.table<-merge(s2[ share.run<=50, list(crossing.county.number=max(id)), by=year],
      s2[ share>0,list(total.county.number=.N), by=year],
      by="year")
print.table[,":="(percent.for.50.percent=percent(crossing.county.number/total.county.number))]

htmlTable(print.table, header=c("Year","Rank at 50%","Total Counties","% for 50%"),
          col.rgroup = c("none", "#F7F7F7"),caption="How many counties comprise 50% of U.S. population?",
          tfoot="&dagger;Counties ranked (descending) in terms of share of total population\nTable revised 8/25/2016 to correct # of counties\n(some omitted due to rounding in earlier version)")
How many counties comprise 50% of U.S. population?
Year Rank at 50% Total Counties % for 50%
1 1810 142 1385 10.25%
2 1860 327 2653 12.33%
3 1910 316 3143 10.05%
4 1960 135 3143 4.30%
5 1980 147 3143 4.68%
6 2010 147 3143 4.68%
†Counties ranked (descending) in terms of share of total population
Table revised 8/25/2016 to correct # of counties
(some omitted due to rounding in earlier version)