2018/05/02

dataviz / R / maps / data wrangling

IN THIS POST WE SHALL EXPLORE VALUE-SUPRESSING UNCERTAINTY PALETTES.

One of my favorite new sites is xenographics that gives examples of and links to “weird, but (sometimes) useful charts”. The examples xenographics gives are undoubtedly interesting and might help inspire you if you’re looking for something new.

One new (to me) graphic was something called Value-Suppressing Uncertainty Palettes (VSUP). See this research paper (pdf). VSUPs “allocate larger ranges of a visual channel when uncertainty is low, and smaller ranges when uncertainty is high”. As we mentioned in an earlier post, much of the economic and housing data I track is highly uncertain. Perhaps VSUPs can be useful.

Per usual, let’s make a graph with R.

Value-Suppressing Uncertainty Palette

Let’s build a VSUP and take a look. We can construct a visual describing the VSUP using ggplot2 and ggforce. The ggforce package is nice because it helps us avoid polar coordinates.

library(tidyverse)
library(viridis)
library(ggforce)
library(ggrepel)
library(scales)
library(cowplot)

#____________________________________________________________________________________
## Draw Legend ----
#____________________________________________________________________________________

# Function to draw legend
d.legend <- function(){
  
  nlevels=4
  X=seq(-0.5/pi,0.5/pi,length.out=8)
  Y=1:4
  df<-expand.grid(x=X,y=Y)
  
  X0 <- seq(-1.75,1.75,.5)/pi
  X1 <- X0[c(TRUE,FALSE)]
  X2 <- X1[c(TRUE,FALSE)]
  X3 <- X2[c(TRUE,FALSE)]
  
  quant.X<- function (x, level=5, X=X0){
    X1 <- X[c(TRUE,FALSE)]
    X2 <- X1[c(TRUE,FALSE)]
    X3 <- X2[c(TRUE,FALSE)]
    case_when(
      level== 5 ~ X[findInterval(x,X)],
      level== 4 ~ X1[findInterval(x,X1)],
      level== 3 ~ X2[findInterval(x,X2)],
      level== 2 ~ X3[findInterval(x,X3)],
      TRUE ~ median(X0)
      
    )
  }
  
  d<-expand.grid(x=X0, y=1:5)

  ggplot(data=dplyr::filter(d,y>1))+
    geom_arc_bar(aes(x0=0,
                     y0=0,
                     r0=y-2,
                     r=y-1,
                     start=x-0.25/pi,
                     end=x+0.25/pi,
                     fill=quant.X(x,y),
                     alpha=y),color=NA)+
    scale_fill_viridis(limits=c(min(X0),max(X0)), option="C", direction=-1)+
    theme_void()+
    theme(legend.position="none", plot.subtitle=element_text(hjust=0.5, face="italic"))+guides(alpha=F)+
    geom_segment(x= 0.1, xend= 2.5, yend= 3.1, y=0,
                 arrow = arrow(length = unit(0.03, "npc"), ends="first"))+
    geom_segment(xend= -2.5, x= 2.5, yend= 4.05, y=4.05,
                 arrow = arrow(length = unit(0.03, "npc"), ends="first"))+
    annotate(geom="text", x= 0.1, y=4.3, label=" Value increases" )+
    annotate(geom="text", x= 1.7, y=1.5, label=" Uncertainty increases" ,angle=45 )+
    labs( title="Value-Supressing Uncertainty Scale",
          subtitle="Darker values indicate higher values, higher transparency indicates higher uncertainty")
    
}

# Call Function
d.legend()

Using VSUP

The VSUP can be useful if we want to plot some data that has some measure of uncertainty attached to it. Let’s use the VSUP to plot median house values by county estimated in the U.S. Census Bureau’s American Community Survey (ACS). We can easily access the ACS data with the acs package. Note for this to work for you you’ll need to get an API key from the U.S. Census Bureau.

I want to look at the median value of owner-occupied housing units. Estimates are reported in table B25077. The ACS also gives us an estimate of the standard error/ margin of error for the estimates.

State data

Let’s start by looking at state data.

#install your key
api.key.install(key="YOUR_API_KEY")

# get state data
geo2 <- geo.make(state='*')
# get state estimates 
hv <- acs.fetch(endyear=2016,         # get data estimates ending in 2016
                span=1,               # for states we can use 1-year ACS estimates 
                geography=geo2,       # state geography
                table.number="B25077" # housing values
                )
# convert to data frame

df.hv <- data.frame(NAME=hv@geography$NAME,
                    value=unname(hv@estimate), 
                    sd=unname(hv@standard.error)) %>%
  filter(! NAME %in% c("Puerto Rico","Alaska","Hawaii" )) # keep only lower 48 states

# Print the data
knitr::kable(df.hv %>% select(NAME, value, sd) %>% 
               mutate(value=dollar(round(value,0)),
                      sd=dollar(round(sd,0))
                      ),
             col.names=c("State", "Median Value ($)", "Standard Error")
             )

State	Median Value ($)	Standard Error
Alabama	$136,200	$1,000
Arizona	$205,900	$1,187
Arkansas	$123,300	$1,058
California	$477,500	$1,011
Colorado	$314,200	$1,433
Connecticut	$274,600	$1,412
Delaware	$243,400	$2,274
District of Columbia	$576,100	$10,779
Florida	$197,700	$801
Georgia	$166,800	$677
Idaho	$189,400	$1,858
Illinois	$186,500	$861
Indiana	$134,800	$602
Iowa	$142,300	$925
Kansas	$144,900	$1,134
Kentucky	$135,600	$869
Louisiana	$158,000	$876
Maine	$184,700	$1,827
Maryland	$306,900	$1,623
Massachusetts	$366,900	$1,367
Michigan	$147,100	$567
Minnesota	$211,800	$813
Mississippi	$113,900	$1,286
Missouri	$151,400	$722
Montana	$217,200	$2,348
Nebraska	$148,100	$1,018
Nevada	$239,500	$1,341
New Hampshire	$251,100	$2,110
New Jersey	$328,200	$1,332
New Mexico	$167,500	$1,303
New York	$302,400	$1,495
North Carolina	$165,400	$633
North Dakota	$184,100	$2,387
Ohio	$140,100	$459
Oklahoma	$132,200	$884
Oregon	$287,100	$1,414
Pennsylvania	$174,100	$447
Rhode Island	$247,700	$2,194
South Carolina	$153,900	$872
South Dakota	$160,700	$1,588
Tennessee	$157,700	$726
Texas	$161,500	$484
Utah	$250,300	$1,654
Vermont	$223,700	$2,437
Virginia	$264,000	$1,245
Washington	$306,400	$1,562
West Virginia	$117,900	$1,424
Wisconsin	$173,200	$640
Wyoming	$209,500	$3,716

How does the standard error of estimates vary with the estimate itself?

s.lin<-
ggplot(data=df.hv, aes(x=value, y=sd) )+
  geom_point()+
  geom_text_repel( aes(label=NAME))+
  scale_x_log10(limits=c(1e5,6e5),breaks=c(1e5, 2e5, 4e5, 6e5), labels=dollar)+scale_y_log10(limits=c(100,12000),labels=dollar)+
  guides(alpha=F,color=F)+
  labs(x="Median House Value (log scale, owner-occupied housing units)",
       y="Standard Errror of estimates (log scale)",
       caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)")+
  theme(plot.caption=element_text(hjust=0))
s.lin

How about if we scale the value estimates by the estimated standard deviation?

s.lin<-
ggplot(data=df.hv, aes(x=value, y=value/sd) )+
  geom_point()+
  geom_text_repel( aes(label=NAME))+
 # scale_color_viridis(option="C",discrete=F,direction=-1)+
  scale_x_log10(limits=c(1e5,6e5),breaks=c(1e5, 2e5, 4e5, 6e5), labels=dollar)+scale_y_log10(limits=c(100,1000))+
  guides(alpha=F,color=F)+
  labs(x="Median House Value (log scale, owner-occupied housing units)",
       y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)",
       caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)")+
  theme(plot.caption=element_text(hjust=0))
s.lin

## Warning: Removed 7 rows containing missing values (geom_point).

## Warning: Removed 7 rows containing missing values (geom_text_repel).

Let’s apply the VSUP scale to this plot. To do so we need to assign each state to a value based on the uncertaninty associated with the estimate. (I may come back and explain in more detail what’s going on with this code, but for now let me just put it down.)

#____________________________________________________________________________________
## Assign Uncertainty category ----
#____________________________________________________________________________________

df.hv <- mutate(df.hv, y2=findInterval(value/sd,quantile(value/sd, c(0,0.25,0.5,0.75,.9))))

X0b<- quantile(df.hv$value,probs=seq(0,0.9,length.out=8))
# Function to quantize the value 
quant.X2<- function (x=0, level=5, Z=X0b){
  X1 <- Z[c(TRUE,FALSE)]
  X2 <- X1[c(TRUE,FALSE)]
  X3 <- X2[c(TRUE,FALSE)]
  case_when(
    level== 5 ~ Z[findInterval(x,Z)],
    level== 4 ~ X1[findInterval(x,X1)],
    level== 3 ~ X2[findInterval(x,X2)],
    level== 2 ~ X3[findInterval(x,X3)],
    TRUE ~ median(Z)
  )
}


df.hv <- mutate(df.hv, z=quant.X2(value,y2))

s.vsup <- 
  ggplot(data=df.hv, aes(x=value, y=value/sd, color=z, alpha=y2))+
  geom_point()+
  ggrepel::geom_text_repel(data=. %>% filter(y2==5), aes(label=NAME))+
  scale_color_viridis(option="C",discrete=F,direction=-1)+
  #scale_x_log10()+scale_y_log10()+
  guides(alpha=F,color=F)+
  labs(x="Median House Value (log scale, owner-occupied housing units)", title="Value Suppressing Uncertainty Palette",
       y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)",
       caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)")+
  theme(plot.caption=element_text(hjust=0))
  
s.lin<-
ggplot(data=df.hv, aes(x=value, y=value/sd, 
                       color=quant.X2(value, level=5, 
                                      Z=quantile(df.hv$value,probs=seq(0,0.9,length.out=8))
                                      )) )+
  geom_point()+
  ggrepel::geom_text_repel(data=. %>% filter(y2==5), aes(label=NAME))+
  scale_color_viridis(option="C",discrete=F,direction=-1)+
  #scale_x_log10()+scale_y_log10()+
  guides(alpha=F,color=F)+
  labs(x="Median House Value (log scale, owner-occupied housing units)", title="Standard Scale",
       y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)",
       caption=" ")+
  theme(plot.caption=element_text(hjust=0))


state.scatters <- plot_grid(s.lin,s.vsup,ncol=1)

print(state.scatters)

What we’ve done here is supressed the color for the observations with low ratios of estimated value to standard error. The utility of this approach might be easier to see with a map.

Making maps

In order to make the maps, we can use the tigris package.

#____________________________________________________________________________________
## Load Map and Filter ----
#____________________________________________________________________________________
us_geo2 <- states(cb=TRUE) 
us_geo48 <- us_geo2[! us_geo2@data$STUSPS %in% c("AK","PR","VI","HI","GU","MP","AS"), ]
us_geo48@data$id <- rownames(us_geo48@data)
us_geo48f <- fortify(us_geo48)
us_geo48f <- left_join(us_geo48f,us_geo48@data, by="id")

df.map <- left_join(us_geo48f,df.hv, by="NAME")


gmap.slin<-
ggplot(data=df.map,
       aes(x=long,y=lat, map_id=id, group=group,
                  fill=factor(quant.X2(value, level=5,
                                 Z=quantile(value,probs=seq(0,0.9,length.out=8))
                                 ))) #use level==5 to force 8 categories
       )+
  
  geom_polygon(color="white")+theme_map()+
  theme(legend.position="none",
        plot.caption=element_text(hjust=1),
        plot.title=element_text(hjust=0))+
  scale_fill_viridis(option="C",discrete=T,direction=-1, name= "Housing value")+
    labs(x="Median House Value (log scale, owner-occupied housing units)", 
         title="Standard Choropleth",
         subtitle="Housing Values (Binned 8 values)",
         y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)",
         caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)")
gmap.slin

COmpare that to our VSUP plot:

gmap.svsup<-
  ggplot(data=df.map , 
       aes(x=long,y=lat, map_id=id, group=group,
           fill =z, alpha=factor(y2)
           ))+
  geom_polygon(color="white")+theme_map()+
    theme(legend.position="none",
        plot.caption=element_text(hjust=1),
        plot.title=element_text(hjust=0))+
  scale_fill_viridis(option="C",discrete=F,direction=-1)+
    labs(x="Median House Value (log scale, owner-occupied housing units)", title="Value Suppressing Uncertainty Palette",
       y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)",
       caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)")

ggdraw()+
  draw_plot(gmap.svsup+theme(legend.position="none",panel.border=element_rect(color="black",fill=NA))+
              labs(title="Median Home Values by County",
                             subtitle="Value Suppressing Uncertainty Palette with R",
                             caption="@lenkiefer Source 2016 American Community Survey  \n")+
              theme(plot.title=element_text(hjust=0)), 0, 0, 1, 1) + 
  draw_plot(d.legend()+labs(title="",subtitle="")+theme_void()+theme(legend.position="none",panel.border=element_rect(color="black",fill=NA)) , 0.01, 0.01, 0.34, 0.35)

## Warning: Using alpha for a discrete variable is not advised.

Now let’s drill down and look at county-level estimates. For this, we’ll need the 5-year ACS estimates.

#####################################################################################
## Draw County Map ##
#####################################################################################


geoc <- geo.make(state='*', county="*")
hvc <- acs.fetch(endyear=2016, 
                 span=5,   # 5-year estimates
                 geography=geoc, table.number="B25077")

df.hvc <- data.frame(NAME=hvc@geography$NAME,
                    state=hvc@geography$state,
                    county=hvc@geography$county,
                    value=unname(hvc@estimate), 
                    sd=unname(hvc@standard.error)) %>%
  mutate(GEOID= paste0(str_pad(state,width=2,side="left",pad="0"),county)) %>% filter(value>0 & sd > 0)


us_geoc <- counties(cb=TRUE) 
us_geoc$id <- rownames(us_geoc@data)
us_geocf <- fortify(us_geoc)
us_geocf     <- left_join(us_geocf,us_geoc@data, by="id")
# drop outside long(-125,67)  & lat(24,49)
us_geocf48<- filter(us_geocf, long > -125 & long < 67 & lat > 25 & lat < 50)

dfc.map <- left_join(us_geocf48,df.hvc, by="GEOID")

#____________________________________________________________________________________
## Make data ----
#____________________________________________________________________________________
df.hvc <- mutate(df.hvc, y2=findInterval(value/sd,quantile(value/sd, c(0,0.25,0.5,0.75,.9))))
X0b<- quantile(df.hvc$value,probs=seq(0,0.9,length.out=8))
quant.X2<- function (x=0, level=5, Z=X0b){
  X1 <- Z[c(TRUE,FALSE)]
  X2 <- X1[c(TRUE,FALSE)]
  X3 <- X2[c(TRUE,FALSE)]
  case_when(
    level== 5 ~ Z[findInterval(x,Z)],
    level== 4 ~ X1[findInterval(x,X1)],
    level== 3 ~ X2[findInterval(x,X2)],
    level== 2 ~ X3[findInterval(x,X3)],
    TRUE ~ median(Z)
  )
}
df.hvc <- mutate(df.hvc, z=quant.X2(value,y2))


#____________________________________________________________________________________
## Create Scatters ----
#____________________________________________________________________________________

s.vsupc<-
  ggplot(data=df.hvc, aes(x=value, y=value/sd, color=z, alpha=y2))+
  geom_point()+
  scale_color_viridis(option="C",discrete=F,direction=-1)+
  scale_x_log10(limits=c(1e4,6e5),breaks=c(1e4,1e5, 2e5, 4e5, 6e5), labels=dollar)+scale_y_log10(limits=c(10,1000))+
  guides(alpha=F,color=F)+
    labs(x="Median House Value (log scale, owner-occupied housing units)",
       y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)")






s.linc<-
  ggplot(data=df.hvc, aes(x=value, y=value/sd, 
                          color=quant.X2(value, level=5, Z=quantile(df.hvc$value,probs=seq(0,0.9,length.out=8)))))+
  geom_point()+
  scale_color_viridis(option="C",discrete=F,direction=-1)+
  scale_x_log10(limits=c(1e4,6e5),breaks=c(1e4,1e5, 2e5, 4e5, 6e5), labels=dollar)+scale_y_log10(limits=c(10,1000))+
  guides(alpha=F,color=F)+
    labs(x="Median House Value (log scale, owner-occupied housing units)",
       y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)")

plot_grid(s.linc+labs(title="Standard Scale"),
          s.vsupc +labs(title="Value Suppressing Uncertainty Palette",
                        caption="@lenkiefer Source 2016 American Community Survey  \nMedian house value by county (5 year estimates)  "),
          ncol=1, align="hv")

## Warning: Removed 135 rows containing missing values (geom_point).

## Warning: Removed 135 rows containing missing values (geom_point).

#____________________________________________________________________________________
## Create Maps ----
#____________________________________________________________________________________
dfc.map <- left_join(us_geocf48,df.hvc, by="GEOID")
gmap.cvsup <- 
ggplot(data=dfc.map , 
       aes(x=long,y=lat, map_id=id, group=group,
           fill =z, alpha=y2
       ))+
  geom_polygon(color="white")+theme_map()+
  scale_fill_viridis(option="C",discrete=F,direction=-1)+
  theme(legend.position="none")

gmap.clin<-
  ggplot(data=dfc.map , 
       aes(x=long,y=lat, map_id=id, group=group,
           fill =z#, alpha=factor(y2)
       ))+
  geom_polygon(color="white")+theme_map()+
  scale_fill_viridis(option="C",discrete=F,direction=-1)+
  theme(legend.position="none")

#____________________________________________________________________________________
## Put it all together ----
#____________________________________________________________________________________

plot_grid(plot_grid(s.linc+labs(title="Standard Choropleth"),
                    s.vsupc+labs(title="Value Suppressing Uncertainty Palette"),
                    ncol=2),
          plot_grid(gmap.clin+ theme(legend.position = "none")+ labs(caption=""),
                    ggdraw()+draw_plot(gmap.cvsup+ theme(legend.position = "none")+ labs(caption="@lenkiefer Source 2016 American Community Survey  \nMedian house value by county (5 year estimates)  "))+
                      draw_plot(d.legend()+labs(title="",subtitle="")+theme_void()+theme(legend.position="none",panel.border=element_rect(color="black",fill=NA)) , 0.01, 0.01, 0.35, 0.375),
                                        ncol=2),
          ncol=1)

## Warning: Removed 135 rows containing missing values (geom_point).

## Warning: Removed 135 rows containing missing values (geom_point).

There’s a lot more we could do with this, but that’s enough for one post.

What's up? VSUP, that's what's up.