2017/04/24

R / dataviz / maps

NOTE: After I posted this (like within 5 minutes) I found this post which also constructs bivariate chropleths in R.

IN THIS POST I WANT TO REVISIT SOME MAPS I MADE LAST YEAR. At that time, I was using Tableau to create choropleth maps, but in this post I want to reimagine the maps and make them in R.

Last year in this post we looked at the relationship between population growth and the growth in housing units from 2010 to 2015. While the Census has released national level estimates for population, the county level estimates are not updated yet. However, you can get estimates for 2010 through 2015, which is what I used last year and will use again this year.

Bivariate choropleth

In last year’s post I wanted to compare population growth to expansion of the housing supply. I constructed a ratio of the growth in population from 2010 to 2015 to the growth in housing units from 2010 to 2015 by county. Then I plotted this ratio on a diverging color scheme.

Instead, we could use a bivariate color scale. If you are into reading, see an extended discussion here.

Or you could do like me. Just jump in we’ll and figure it out as we go. Here goes!

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.1

## -- Attaching packages -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.0     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'purrr' was built under R version 3.5.1

## Warning: package 'stringr' was built under R version 3.5.1

## -- Conflicts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(viridis)

## Loading required package: viridisLite

library(readxl)
library(stringr)
d<-expand.grid(x=1:100,y=1:100)

ggplot(d, aes(x,y,fill=atan(y/x),alpha=x+y))+
  geom_tile()+
  scale_fill_viridis()+
  theme(legend.position="none",
        panel.background=element_blank())+
  labs(title="A bivariate color scheme (Viridis)")

We could allow this scheme to vary smoothly, or it might be easier to see what’s going on, but using a discrete scale. A 3x3 grid seems to work pretty well.

# Add some labels
d<-expand.grid(x=1:3,y=1:3)
#dlabel<-data.frame(x=1:3,xlabel=c("X low", "X middle","X High"))
d<-merge(d,data.frame(x=1:3,xlabel=c("X low", "X middle","X high")),by="x")
d<-merge(d,data.frame(y=1:3,ylabel=c("Y low", "Y middle","Y high")),by="y")

g.legend<-
  ggplot(d, aes(x,y,fill=atan(y/x),alpha=x+y,label=paste0(xlabel,"\n",ylabel)))+
  geom_tile()+
  geom_text(alpha=1)+
  scale_fill_viridis()+
  theme_void()+
  theme(legend.position="none",
        panel.background=element_blank(),
        plot.margin=margin(t=10,b=10,l=10))+
  labs(title="A bivariate color scheme (Viridis)",x="X",y="Y")+
    theme(axis.title=element_text(color="black"))+
  # Draw some arrows:
  geom_segment(aes(x=1, xend = 3 , y=0, yend = 0), size=1.5,
               arrow = arrow(length = unit(0.6,"cm"))) +
  geom_segment(aes(x=0, xend = 0 , y=1, yend = 3), size=1.5,
               arrow = arrow(length = unit(0.6,"cm"))) 
g.legend

Get some data

I happened to have the data in a spreadsheet called pophous.xlsx. You can get estimates direct from the U.S. Cesnsus Bureau (here for housing units and here for population).

I rearranged my data in a spreadsheet like this:

data img

This was of course, before I wised up about readxl. But since I had these data around, we’ll just start here.

# Load data and check

dfh<-read_excel("data/pophous.xlsx", sheet = "housing1")
dfp<-read_excel("data/pophous.xlsx", sheet = "pop1")
df<-left_join(dfh,dfp,by="fips")
df <- df %>% select(-h2010apr,-h2010base,-p2010apr,-p2010base)


htmlTable::htmlTable(rbind(head(df %>% 
                                  map_if(is_numeric,round,0) %>% 
                                  data.frame() %>% as.tbl())))

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

	fips	h2010	h2011	h2012	h2013	h2014	h2015	p2010	p2011	p2012	p2013	p2014	p2015
1	1001	22152	22284	22352	22670	22751	22847	54660	55253	55175	55038	55290	55347
2	1003	104248	104701	105264	106227	107368	108564	183193	186659	190396	195126	199713	203709
3	1005	11827	11808	11841	11816	11799	11789	27341	27226	27159	26973	26815	26489
4	1007	8982	8972	8972	8972	8977	8986	22861	22733	22642	22512	22549	22583
5	1009	23887	23875	23862	23841	23826	23817	57373	57711	57776	57734	57658	57673
6	1011	4492	4483	4477	4468	4461	4456	10887	10629	10606	10628	10829	10696

After clearning our data are arranged with state/county FIPS numbers and then variables p2010-p2015 and h2010-h2015 corresponding to population estimates for each year 2010-2015 and housing units for each year 2010-2015 respectively.

The FIPS numbers are stored as numbers (thanks Excel!) and we’d like to rearrange the data, so let’s do some tidying.

# gather up columns by fips
df2<-df %>% gather(var,value,-fips)

# Create a type varialbe (h for housing, p for population) & year variable
df2<-df2 %>% mutate( type=substr(var,1,1), year=as.numeric(substr(var,2,5)))

# spread back out housing units and population
df3 <- df2 %>% select(fips,type,year,value) %>% spread(type,value)

# turn fips into string by padding with a 0
df3<-mutate(df3, fips=str_pad(fips, 5, pad = "0"))

# comput housing units and population in 2010
df3.sums<-df3 %>% group_by(fips) %>%  summarize(h2010= h[sum(year==2010)],
                                           p2010= p[sum(year==2010)])
df3<-left_join(df3,df3.sums,by="fips")

# create variables for housing unites (hr) and population (pr) relative to 2010
# also create ratio 
df3 <- df3 %>% mutate(hr=h/h2010, pr=p/p2010, ratio=(h/h2010)/(p/p2010))

# Check it out
htmlTable::htmlTable(rbind(head(df3 %>% 
                                  map_if(is_numeric,round,0) %>% 
                                  data.frame() %>% as.tbl())))

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

	fips	year	h	p	h2010	p2010	hr	pr	ratio
1	01001	2010	22152	54660	22152	54660	1	1	1
2	01001	2011	22284	55253	22152	54660	1	1	1
3	01001	2012	22352	55175	22152	54660	1	1	1
4	01001	2013	22670	55038	22152	54660	1	1	1
5	01001	2014	22751	55290	22152	54660	1	1	1
6	01001	2015	22847	55347	22152	54660	1	1	1

Now things are looking a lot better.

Let’s filter to just 2015 and get ready to make our map.

First, for housing units and population (both relative to 2010) we’ll compute terciles (divide the range into 3 parts each with 1/3 of observations). We’ll then categorize each observation as falling within one of these terciles. Then, we’ll let these terciles be ordered 1 through 3 and construct a scatterplot.

d2015<-filter(df3,year==2015)

# compute quantiles (dividing into 3rds) 
h.v<-quantile(d2015$hr,c(0.33,0.66,1))
p.v<-quantile(d2015$pr,c(0.33,0.66,1))

d2015<- d2015 %>% mutate(y= ifelse(pr<p.v[1],1,ifelse(pr<p.v[2],2,3)) ,
                     x= ifelse(hr<h.v[1],1,ifelse(hr<h.v[2],2,3))  )  

ggplot(data=d2015,aes(x=hr,y=pr,color=atan(y/x),alpha=x+y))+
  geom_point(size=1)+  guides(alpha=F,color=F)+
  geom_hline(yintercept=p.v,color="gray20",linetype=2)+
  geom_vline(xintercept=h.v,color="gray20",linetype=2)+
  scale_color_viridis(name="Color scale")+theme_minimal()+
  theme(plot.caption=element_text(size = 9, hjust=0),
        panel.grid=element_blank()) +
  
  labs(x="Housing units in 2015 relative to 2010 (log scale)",
       y="Population in 2015 relative to 2010 (log scale)",
       caption="@lenkiefer Source: U.S. Census Bureau\nEach dot one county, lines divide (univariate) terciles")+
  # limit the rang e
  scale_x_log10(limits=c(0.95,1.05), breaks=c(h.v),
                labels=round(c(h.v),2)) +
  scale_y_log10(limits=c(0.95,1.05),breaks=c(p.v),
                labels=round(c(p.v),2))

## Warning: Removed 646 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_hline).

## Warning: Removed 1 rows containing missing values (geom_vline).

Now that we’ve divided up the observations, let’s map them!

#################################################################################
### Map Libraries
library(ggthemes)
library(ggalt)
library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

library(rgeos)

## rgeos version: 0.3-28, (SVN revision 572)
##  GEOS runtime version: 3.6.1-CAPI-1.10.1 r0 
##  Linking to sp version: 1.3-1 
##  Polygon checking: TRUE

library(maptools)

## Loading required package: sp

## Checking rgeos availability: TRUE

library(albersusa)
library(grid)
#################################################################################

#################################################################################
# Let's load some maps:
states<-usa_composite()  #create a state map thing
smap<-fortify(states,region="fips_state")
counties <- counties_composite()   #create a county map thing
#################################################################################
#add on summary stats by county using FIPS code
counties@data <- left_join(counties@data, d2015, by = "fips")   
#################################################################################
cmap <- fortify(counties_composite(), region="fips")
#create state and county FIPS codes 
cmap$state<-substr(cmap$id,1,2)  
cmap$county<-substr(cmap$id,3,5)
cmap$fips<-paste0(cmap$state,cmap$county)

#### Make a legend
g.legend<-
  ggplot(d, aes(x,y,fill=atan(y/x),alpha=x+y,label=paste0(xlabel,"\n",ylabel)))+
  geom_tile()+
  scale_fill_viridis()+
  theme_void()+
  theme(legend.position="none",
        axis.title=element_text(size=5),
        panel.background=element_blank(),
        plot.margin=margin(t=10,b=10,l=10))+
  theme(axis.title=element_text(color="black"))+ 
  labs(x="Housing unit growth",
       y="Population growth")


gmap<-
ggplot() +
  geom_map(data =cmap, map = cmap,
           aes(x = long, y = lat, map_id = id),
           color = "#2b2b2b", size = 0.05, fill = NA) +
  geom_map(data = filter(counties@data,year==2015), map = cmap,
           aes(fill =atan(y/x),alpha=x+y, map_id = fips),
           color = "gray50") +
  #add black state borders (just to see if things are working)
  geom_map(data = smap, map = smap,
           aes(x = long, y = lat, map_id = id),
           color = "black", size = .5, fill = NA) +
  theme_map(base_size = 12) +
  theme(plot.title=element_text(size = 16, face="bold",margin=margin(b=10))) +
  theme(plot.subtitle=element_text(size = 14, margin=margin(b=-20))) +
  theme(plot.caption=element_text(size = 9, margin=margin(t=-15),hjust=0)) +
 # scale_fill_gradient(low="red",high="blue")
  scale_fill_viridis()+guides(alpha=F,fill=F)+
  labs(caption="@lenkiefer Source: U.S. Census Bureau",
       title="Housing units and Population growth 2010-2015",
       subtitle="Bivariate choropleth")
  
vp<-viewport(width=0.24,height=0.24,x=0.58,y=0.14)

print(gmap)
print(g.legend+labs(title=""),vp=vp)

So what?

Well, that was pretty neat.

I think you probably should be careful about interpreting these maps, particularly when you have some counties that might not be estimated as well as others. Still, the ability to display two variables on one map might be handy sometime.

In the past I compared employment and house prices and a bivariate choropleth might be a good option for comparing how these variables evolve. I’ve also go some research going looking at some other aspects of the housing market where such a plot might at least be useful at the data exploration stage.

How could it work for you?

Bivariate choropleth maps with R

Bivariate choropleth

Get some data

So what?

Share!