# What's up? VSUP, that's what's up.

## Value-Suppressing Uncertainty Palettes allocate a larger range of a visual channel when uncertainty is low, and smaller ranges when uncertainty is high. Let us make one with R.

IN THIS POST WE SHALL EXPLORE VALUE-SUPRESSING UNCERTAINTY PALETTES.

One of my favorite new sites is xenographics that gives examples of and links to “weird, but (sometimes) useful charts”. The examples xenographics gives are undoubtedly interesting and might help inspire you if you’re looking for something new.

One new (to me) graphic was something called Value-Suppressing Uncertainty Palettes (VSUP). See this research paper (pdf). VSUPs “allocate larger ranges of a visual channel when uncertainty is low, and smaller ranges when uncertainty is high”. As we mentioned in an earlier post, much of the economic and housing data I track is highly uncertain. Perhaps VSUPs can be useful.

Per usual, let’s make a graph with R.

# Value-Suppressing Uncertainty Palette

Let’s build a VSUP and take a look. We can construct a visual describing the VSUP using ggplot2 and ggforce. The ggforce package is nice because it helps us avoid polar coordinates.

library(tidyverse)
library(viridis)
library(ggforce)
library(ggrepel)
library(scales)
library(cowplot)

#____________________________________________________________________________________
## Draw Legend ----
#____________________________________________________________________________________

# Function to draw legend
d.legend <- function(){

nlevels=4
X=seq(-0.5/pi,0.5/pi,length.out=8)
Y=1:4
df<-expand.grid(x=X,y=Y)

X0 <- seq(-1.75,1.75,.5)/pi
X1 <- X0[c(TRUE,FALSE)]
X2 <- X1[c(TRUE,FALSE)]
X3 <- X2[c(TRUE,FALSE)]

quant.X<- function (x, level=5, X=X0){
X1 <- X[c(TRUE,FALSE)]
X2 <- X1[c(TRUE,FALSE)]
X3 <- X2[c(TRUE,FALSE)]
case_when(
level== 5 ~ X[findInterval(x,X)],
level== 4 ~ X1[findInterval(x,X1)],
level== 3 ~ X2[findInterval(x,X2)],
level== 2 ~ X3[findInterval(x,X3)],
TRUE ~ median(X0)

)
}

d<-expand.grid(x=X0, y=1:5)

ggplot(data=dplyr::filter(d,y>1))+
geom_arc_bar(aes(x0=0,
y0=0,
r0=y-2,
r=y-1,
start=x-0.25/pi,
end=x+0.25/pi,
fill=quant.X(x,y),
alpha=y),color=NA)+
scale_fill_viridis(limits=c(min(X0),max(X0)), option="C", direction=-1)+
theme_void()+
theme(legend.position="none", plot.subtitle=element_text(hjust=0.5, face="italic"))+guides(alpha=F)+
geom_segment(x= 0.1, xend= 2.5, yend= 3.1, y=0,
arrow = arrow(length = unit(0.03, "npc"), ends="first"))+
geom_segment(xend= -2.5, x= 2.5, yend= 4.05, y=4.05,
arrow = arrow(length = unit(0.03, "npc"), ends="first"))+
annotate(geom="text", x= 0.1, y=4.3, label=" Value increases" )+
annotate(geom="text", x= 1.7, y=1.5, label=" Uncertainty increases" ,angle=45 )+
labs( title="Value-Supressing Uncertainty Scale",
subtitle="Darker values indicate higher values, higher transparency indicates higher uncertainty")

}

# Call Function
d.legend()

# Using VSUP

The VSUP can be useful if we want to plot some data that has some measure of uncertainty attached to it. Let’s use the VSUP to plot median house values by county estimated in the U.S. Census Bureau’s American Community Survey (ACS). We can easily access the ACS data with the acs package. Note for this to work for you you’ll need to get an API key from the U.S. Census Bureau.

I want to look at the median value of owner-occupied housing units. Estimates are reported in table B25077. The ACS also gives us an estimate of the standard error/ margin of error for the estimates.

## State data

Let’s start by looking at state data.

#install your key
api.key.install(key="YOUR_API_KEY")

# get state data
geo2 <- geo.make(state='*')
# get state estimates
hv <- acs.fetch(endyear=2016,         # get data estimates ending in 2016
span=1,               # for states we can use 1-year ACS estimates
geography=geo2,       # state geography
table.number="B25077" # housing values
)
# convert to data frame

df.hv <- data.frame(NAME=hv@geography$NAME, value=unname(hv@estimate), sd=unname(hv@standard.error)) %>% filter(! NAME %in% c("Puerto Rico","Alaska","Hawaii" )) # keep only lower 48 states # Print the data knitr::kable(df.hv %>% select(NAME, value, sd) %>% mutate(value=dollar(round(value,0)), sd=dollar(round(sd,0)) ), col.names=c("State", "Median Value ($)", "Standard Error")
)
State Median Value ($) Standard Error Alabama$136,200 $1,000 Arizona$205,900 $1,187 Arkansas$123,300 $1,058 California$477,500 $1,011 Colorado$314,200 $1,433 Connecticut$274,600 $1,412 Delaware$243,400 $2,274 District of Columbia$576,100 $10,779 Florida$197,700 $801 Georgia$166,800 $677 Idaho$189,400 $1,858 Illinois$186,500 $861 Indiana$134,800 $602 Iowa$142,300 $925 Kansas$144,900 $1,134 Kentucky$135,600 $869 Louisiana$158,000 $876 Maine$184,700 $1,827 Maryland$306,900 $1,623 Massachusetts$366,900 $1,367 Michigan$147,100 $567 Minnesota$211,800 $813 Mississippi$113,900 $1,286 Missouri$151,400 $722 Montana$217,200 $2,348 Nebraska$148,100 $1,018 Nevada$239,500 $1,341 New Hampshire$251,100 $2,110 New Jersey$328,200 $1,332 New Mexico$167,500 $1,303 New York$302,400 $1,495 North Carolina$165,400 $633 North Dakota$184,100 $2,387 Ohio$140,100 $459 Oklahoma$132,200 $884 Oregon$287,100 $1,414 Pennsylvania$174,100 $447 Rhode Island$247,700 $2,194 South Carolina$153,900 $872 South Dakota$160,700 $1,588 Tennessee$157,700 $726 Texas$161,500 $484 Utah$250,300 $1,654 Vermont$223,700 $2,437 Virginia$264,000 $1,245 Washington$306,400 $1,562 West Virginia$117,900 $1,424 Wisconsin$173,200 $640 Wyoming$209,500 $3,716 How does the standard error of estimates vary with the estimate itself? s.lin<- ggplot(data=df.hv, aes(x=value, y=sd) )+ geom_point()+ geom_text_repel( aes(label=NAME))+ scale_x_log10(limits=c(1e5,6e5),breaks=c(1e5, 2e5, 4e5, 6e5), labels=dollar)+scale_y_log10(limits=c(100,12000),labels=dollar)+ guides(alpha=F,color=F)+ labs(x="Median House Value (log scale, owner-occupied housing units)", y="Standard Errror of estimates (log scale)", caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)")+ theme(plot.caption=element_text(hjust=0)) s.lin How about if we scale the value estimates by the estimated standard deviation? s.lin<- ggplot(data=df.hv, aes(x=value, y=value/sd) )+ geom_point()+ geom_text_repel( aes(label=NAME))+ # scale_color_viridis(option="C",discrete=F,direction=-1)+ scale_x_log10(limits=c(1e5,6e5),breaks=c(1e5, 2e5, 4e5, 6e5), labels=dollar)+scale_y_log10(limits=c(100,1000))+ guides(alpha=F,color=F)+ labs(x="Median House Value (log scale, owner-occupied housing units)", y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)", caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)")+ theme(plot.caption=element_text(hjust=0)) s.lin ## Warning: Removed 7 rows containing missing values (geom_point). ## Warning: Removed 7 rows containing missing values (geom_text_repel). Let’s apply the VSUP scale to this plot. To do so we need to assign each state to a value based on the uncertaninty associated with the estimate. (I may come back and explain in more detail what’s going on with this code, but for now let me just put it down.) #____________________________________________________________________________________ ## Assign Uncertainty category ---- #____________________________________________________________________________________ df.hv <- mutate(df.hv, y2=findInterval(value/sd,quantile(value/sd, c(0,0.25,0.5,0.75,.9)))) X0b<- quantile(df.hv$value,probs=seq(0,0.9,length.out=8))
# Function to quantize the value
quant.X2<- function (x=0, level=5, Z=X0b){
X1 <- Z[c(TRUE,FALSE)]
X2 <- X1[c(TRUE,FALSE)]
X3 <- X2[c(TRUE,FALSE)]
case_when(
level== 5 ~ Z[findInterval(x,Z)],
level== 4 ~ X1[findInterval(x,X1)],
level== 3 ~ X2[findInterval(x,X2)],
level== 2 ~ X3[findInterval(x,X3)],
TRUE ~ median(Z)
)
}

df.hv <- mutate(df.hv, z=quant.X2(value,y2))

s.vsup <-
ggplot(data=df.hv, aes(x=value, y=value/sd, color=z, alpha=y2))+
geom_point()+
ggrepel::geom_text_repel(data=. %>% filter(y2==5), aes(label=NAME))+
scale_color_viridis(option="C",discrete=F,direction=-1)+
#scale_x_log10()+scale_y_log10()+
guides(alpha=F,color=F)+
labs(x="Median House Value (log scale, owner-occupied housing units)", title="Value Suppressing Uncertainty Palette",
y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)",
caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)")+
theme(plot.caption=element_text(hjust=0))

s.lin<-
ggplot(data=df.hv, aes(x=value, y=value/sd,
color=quant.X2(value, level=5,
Z=quantile(df.hv$value,probs=seq(0,0.9,length.out=8)) )) )+ geom_point()+ ggrepel::geom_text_repel(data=. %>% filter(y2==5), aes(label=NAME))+ scale_color_viridis(option="C",discrete=F,direction=-1)+ #scale_x_log10()+scale_y_log10()+ guides(alpha=F,color=F)+ labs(x="Median House Value (log scale, owner-occupied housing units)", title="Standard Scale", y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)", caption=" ")+ theme(plot.caption=element_text(hjust=0)) state.scatters <- plot_grid(s.lin,s.vsup,ncol=1) print(state.scatters) What we’ve done here is supressed the color for the observations with low ratios of estimated value to standard error. The utility of this approach might be easier to see with a map. ## Making maps In order to make the maps, we can use the tigris package. #____________________________________________________________________________________ ## Load Map and Filter ---- #____________________________________________________________________________________ us_geo2 <- states(cb=TRUE) us_geo48 <- us_geo2[! us_geo2@data$STUSPS %in% c("AK","PR","VI","HI","GU","MP","AS"), ]
us_geo48@data$id <- rownames(us_geo48@data) us_geo48f <- fortify(us_geo48) us_geo48f <- left_join(us_geo48f,us_geo48@data, by="id") df.map <- left_join(us_geo48f,df.hv, by="NAME") gmap.slin<- ggplot(data=df.map, aes(x=long,y=lat, map_id=id, group=group, fill=factor(quant.X2(value, level=5, Z=quantile(value,probs=seq(0,0.9,length.out=8)) ))) #use level==5 to force 8 categories )+ geom_polygon(color="white")+theme_map()+ theme(legend.position="none", plot.caption=element_text(hjust=1), plot.title=element_text(hjust=0))+ scale_fill_viridis(option="C",discrete=T,direction=-1, name= "Housing value")+ labs(x="Median House Value (log scale, owner-occupied housing units)", title="Standard Choropleth", subtitle="Housing Values (Binned 8 values)", y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)", caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)") gmap.slin COmpare that to our VSUP plot: gmap.svsup<- ggplot(data=df.map , aes(x=long,y=lat, map_id=id, group=group, fill =z, alpha=factor(y2) ))+ geom_polygon(color="white")+theme_map()+ theme(legend.position="none", plot.caption=element_text(hjust=1), plot.title=element_text(hjust=0))+ scale_fill_viridis(option="C",discrete=F,direction=-1)+ labs(x="Median House Value (log scale, owner-occupied housing units)", title="Value Suppressing Uncertainty Palette", y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)", caption="@lenkiefer Source: U.S. Census Bureau American Community Survey (1-year estimates)") ggdraw()+ draw_plot(gmap.svsup+theme(legend.position="none",panel.border=element_rect(color="black",fill=NA))+ labs(title="Median Home Values by County", subtitle="Value Suppressing Uncertainty Palette with R", caption="@lenkiefer Source 2016 American Community Survey \n")+ theme(plot.title=element_text(hjust=0)), 0, 0, 1, 1) + draw_plot(d.legend()+labs(title="",subtitle="")+theme_void()+theme(legend.position="none",panel.border=element_rect(color="black",fill=NA)) , 0.01, 0.01, 0.34, 0.35) ## Warning: Using alpha for a discrete variable is not advised. Now let’s drill down and look at county-level estimates. For this, we’ll need the 5-year ACS estimates. ##################################################################################### ## Draw County Map ## ##################################################################################### geoc <- geo.make(state='*', county="*") hvc <- acs.fetch(endyear=2016, span=5, # 5-year estimates geography=geoc, table.number="B25077") df.hvc <- data.frame(NAME=hvc@geography$NAME,
state=hvc@geography$state, county=hvc@geography$county,
value=unname(hvc@estimate),
sd=unname(hvc@standard.error)) %>%

us_geoc <- counties(cb=TRUE)
us_geoc$id <- rownames(us_geoc@data) us_geocf <- fortify(us_geoc) us_geocf <- left_join(us_geocf,us_geoc@data, by="id") # drop outside long(-125,67) & lat(24,49) us_geocf48<- filter(us_geocf, long > -125 & long < 67 & lat > 25 & lat < 50) dfc.map <- left_join(us_geocf48,df.hvc, by="GEOID") #____________________________________________________________________________________ ## Make data ---- #____________________________________________________________________________________ df.hvc <- mutate(df.hvc, y2=findInterval(value/sd,quantile(value/sd, c(0,0.25,0.5,0.75,.9)))) X0b<- quantile(df.hvc$value,probs=seq(0,0.9,length.out=8))
quant.X2<- function (x=0, level=5, Z=X0b){
X1 <- Z[c(TRUE,FALSE)]
X2 <- X1[c(TRUE,FALSE)]
X3 <- X2[c(TRUE,FALSE)]
case_when(
level== 5 ~ Z[findInterval(x,Z)],
level== 4 ~ X1[findInterval(x,X1)],
level== 3 ~ X2[findInterval(x,X2)],
level== 2 ~ X3[findInterval(x,X3)],
TRUE ~ median(Z)
)
}
df.hvc <- mutate(df.hvc, z=quant.X2(value,y2))

#____________________________________________________________________________________
## Create Scatters ----
#____________________________________________________________________________________

s.vsupc<-
ggplot(data=df.hvc, aes(x=value, y=value/sd, color=z, alpha=y2))+
geom_point()+
scale_color_viridis(option="C",discrete=F,direction=-1)+
scale_x_log10(limits=c(1e4,6e5),breaks=c(1e4,1e5, 2e5, 4e5, 6e5), labels=dollar)+scale_y_log10(limits=c(10,1000))+
guides(alpha=F,color=F)+
labs(x="Median House Value (log scale, owner-occupied housing units)",
y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)")

s.linc<-
ggplot(data=df.hvc, aes(x=value, y=value/sd,
color=quant.X2(value, level=5, Z=quantile(df.hvc\$value,probs=seq(0,0.9,length.out=8)))))+
geom_point()+
scale_color_viridis(option="C",discrete=F,direction=-1)+
scale_x_log10(limits=c(1e4,6e5),breaks=c(1e4,1e5, 2e5, 4e5, 6e5), labels=dollar)+scale_y_log10(limits=c(10,1000))+
guides(alpha=F,color=F)+
labs(x="Median House Value (log scale, owner-occupied housing units)",
y="Ratio: Housing Value/Standard Errror of estimates\n(log scale)")

plot_grid(s.linc+labs(title="Standard Scale"),
s.vsupc +labs(title="Value Suppressing Uncertainty Palette",
caption="@lenkiefer Source 2016 American Community Survey  \nMedian house value by county (5 year estimates)  "),
ncol=1, align="hv")
## Warning: Removed 135 rows containing missing values (geom_point).

## Warning: Removed 135 rows containing missing values (geom_point).

#____________________________________________________________________________________
## Create Maps ----
#____________________________________________________________________________________
dfc.map <- left_join(us_geocf48,df.hvc, by="GEOID")
gmap.cvsup <-
ggplot(data=dfc.map ,
aes(x=long,y=lat, map_id=id, group=group,
fill =z, alpha=y2
))+
geom_polygon(color="white")+theme_map()+
scale_fill_viridis(option="C",discrete=F,direction=-1)+
theme(legend.position="none")

gmap.clin<-
ggplot(data=dfc.map ,
aes(x=long,y=lat, map_id=id, group=group,
fill =z#, alpha=factor(y2)
))+
geom_polygon(color="white")+theme_map()+
scale_fill_viridis(option="C",discrete=F,direction=-1)+
theme(legend.position="none")

#____________________________________________________________________________________
## Put it all together ----
#____________________________________________________________________________________

plot_grid(plot_grid(s.linc+labs(title="Standard Choropleth"),
s.vsupc+labs(title="Value Suppressing Uncertainty Palette"),
ncol=2),
plot_grid(gmap.clin+ theme(legend.position = "none")+ labs(caption=""),
ggdraw()+draw_plot(gmap.cvsup+ theme(legend.position = "none")+ labs(caption="@lenkiefer Source 2016 American Community Survey  \nMedian house value by county (5 year estimates)  "))+
draw_plot(d.legend()+labs(title="",subtitle="")+theme_void()+theme(legend.position="none",panel.border=element_rect(color="black",fill=NA)) , 0.01, 0.01, 0.35, 0.375),
ncol=2),
ncol=1)
## Warning: Removed 135 rows containing missing values (geom_point).

## Warning: Removed 135 rows containing missing values (geom_point).

There’s a lot more we could do with this, but that’s enough for one post.