HOUSING DATA ARE OFTEN MEASURED WITH CONSIDERABLE uncertainty. Estimates are usually based on small samples that are subject to sampling variability. The various government statistical agencies usually report estimates of uncertainty with their releases. For example, both the New Residential Construction and New Residential Sales reports include estimates of sampling uncertainty along with their point estimates.

In this post I want to explore ways to visualize sampling uncertainty with R. I am reminded of article from a the New York Times Upshot blog a few years ago.

Data

For data, let’s go ahead and use New Home Sales estimates from the U.S. Census Bureau and U.S. Department of Housing and Urban Development. The Census provides a nice .csv file you can download here. The spreadsheet includes estimates of sampling uncertainty.

If you go to this link you can get a zip file that contains the data we’ll use. If you open the .csv file in Excel, you will find the data actually begins on row 705 (as of April 26, 2017, it will move over time). Let’s proceed you’ve unzipped the .csv file and saved it somewhere as RESSALES-mf.csv.

Note that this file is laid out much the same as the housing starts data we used last week.

##################################################################################
# Load libraries
##################################################################################
library("animation")
library("ggplot2")
library("scales")
library('ggthemes')
library(viridis)
## Loading required package: viridisLite
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## col_factor(): readr, scales
## discard():    purrr, scales
## filter():     dplyr, stats
## lag():        dplyr, stats
library(readxl)
library(ggbeeswarm)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
##################################################################################
# Load Data
##################################################################################

df.sales<-read.csv("data/RESSALES-mf.csv",skip=704)

##################################################################################
# The following information comes straight from the .csv file
# and describes the keys in the data file
##################################################################################

##################################################################################
# CATEGORIES
# cat_idx   cat_code    cat_desc    cat_indent
# 1 SOLD    New Single-family Houses Sold   0
# 2 ASOLD   Annual Rate for New Single-family Houses Sold   0
# 3 FORSALE New Single-family Houses For Sale   0

##################################################################################

##################################################################################
# DATA TYPES            
# dt_idx    dt_code dt_desc dt_unit
# 1 TOTAL   All Houses  K
# 2 NOTSTD  Houses that are Not Started K
# 3 UNDERC  Houses that are Under Construction  K
# 4 COMPED  Houses that are Completed   K
# 5 MEDIAN  Median Sales Price  DOL
# 6 AVERAG  Average Sales Price DOL
# 7 MONSUP  Months' Supply at Current Sales Rate    MO
# 8 MMTHS   Median Number of Months For Sale Since Completion   MO
##################################################################################

##################################################################################
# ERROR TYPES           
# et_idx    et_code et_desc et_unit
# 1 E_TOTAL Relative Standard Error for All Houses  PCT
# 2 E_NOTSTD    Relative Standard Error for Houses that are Not Started PCT
# 3 E_UNDERC    Relative Standard Error for Houses that are Under Construction  PCT
# 4 E_COMPED    Relative Standard Error for Houses that are Completed   PCT
# 5 E_MEDIAN    Relative Standard Error for Median Sales Price  PCT
# 6 E_AVERAG    Relative Standard Error for Average Sales Price PCT
# 7 E_MONSUP    Relative Standard Error for Months' Supply at Current Sales Rate    PCT
# 8 E_MMTHS Relative Standard Error for Median Number of Months For Sale Since Completion   PCT
##################################################################################

##################################################################################
# GEO LEVELS        
# geo_idx   geo_code    geo_desc
# 1 US  United States
# 2 NE  Northeast
# 3 MW  Midwest
# 4 SO  South
# 5 WE  West
##################################################################################

##################################################################################
# Dates are indexed one a month from 1963-01-01 to 2017-03-01
# e. g. 
# TIME PERIODS  
# per_idx   per_name
# 1 1/1/19563
# 2 2/1/1963
# ....
# 651 3/1/2017
##################################################################################


##################################################################################
# Construct a lookup table for dates
dt.lookup<- data.table(per_idx=seq(1,651),
                       date=seq.Date(as.Date("1963-01-01"),
                                     as.Date("2017-03-01"),by="month"))
##################################################################################

##################################################################################
# Append dataes
df.sales<-left_join(df.sales,dt.lookup,by="per_idx")
##################################################################################

##################################################################################
# print a table using the htmlTable library, round numeric to 0 digits for readability 
# Note we won't round in analysis)
##################################################################################

htmlTable::htmlTable(rbind(tail(df.sales %>% 
                                  map_if(is_numeric,round,0) %>% 
                                  data.frame() %>% as.tbl())))
## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated
per_idx cat_idx dt_idx et_idx geo_idx is_adj val date
1 651 3 1 0 3 0 34 2017-03-01
2 651 3 0 1 3 0 10 2017-03-01
3 651 3 1 0 4 0 144 2017-03-01
4 651 3 0 1 4 0 6 2017-03-01
5 651 3 1 0 5 0 62 2017-03-01
6 651 3 0 1 5 0 7 2017-03-01

Let’s organize the data a little bit more.

##################################################################################
# Filter to just the us, total sales at an annual rate
new.sales<-filter(df.sales, cat_idx==2 & (dt_idx==1 | et_idx==1) & geo_idx ==1 )
##################################################################################

##################################################################################
# Rearrange the data
new.sales<-new.sales %>% filter(year(date)>1999) %>% 
  select(date,val,et_idx) %>% spread(et_idx,val)
# Rename columns
colnames(new.sales)<-c("date","sales","e.sales")
##################################################################################

# Check it out:
htmlTable::htmlTable(rbind(tail(new.sales %>% 
                                  map_if(is_numeric,round,0) %>% 
                                  data.frame() %>% as.tbl())))
## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated
date sales e.sales
1 2016-10-01 568 8
2 2016-11-01 573 8
3 2016-12-01 551 7
4 2017-01-01 585 8
5 2017-02-01 587 8
6 2017-03-01 621 8

VIZ 1: Ribbon Chart

First, let’s remake a viz we’ve done before. We’ll plot a standard line chart and add a ribbon capturing uncertainty.

##################################################################################
# Compute ribbon size

new.sales <- new.sales %>% mutate( up=qnorm(0.95,mean=sales,sd=e.sales/100*sales),
                                   down=qnorm(0.05,mean=sales,sd=e.sales/100*sales))
##################################################################################


##################################################################################
# Make Plot
ggplot(data=new.sales, aes(x=date,y=sales, label = sales))+
    geom_line()+
    scale_y_continuous() + 
    scale_x_date(labels= date_format("%Y"), date_breaks="1 year" ) +
    geom_ribbon(aes(x=date,ymin=down,ymax=up),fill=plasma(5)[5],alpha=0.5)  +
    theme_minimal()+
    labs(x=NULL, y=NULL,
         title="New Home Sales (Ths. SAAR)",
         subtitle="shaded region denotes confidence interval",
           caption="@lenkiefer Source: U.S. Census Bureau and U.S. Department of Housing and Urban Development")+
    theme(plot.caption=element_text(hjust=0))

Viz 2: Gif

Instead of using a ribbon, let’s draw random samples and animate them to highlight uncertainty.

##################################################################################
# Function for sampling
myf<- function(sales,e.sales){
  rnorm(250,sales,e.sales/100*sales)
}
##################################################################################


##################################################################################
# draw samples using map2, then unnest to blow up data and group
output.data<-new.sales %>% 
  mutate(sales.samp =map2(sales,e.sales,myf)) %>%  # draw our samples
  unnest(sales.samp) %>%  # unpack the samples
  group_by(date) %>% 
  mutate(id=row_number()) %>% ungroup()  # this gives us an id for each sample
##################################################################################

Now we can animate it:

##################################################################################
# Animate plot!
##################################################################################
oopt = ani.options(interval = 0.25)
saveGIF({for (i in 1:100) {
  g<-
    ggplot(data=filter(output.data,year(date)>2015 & id<=i),aes(x=date,y=sales.samp,group=id))+
    geom_line(color="gray50",aes(alpha=ifelse(id==i,1,0.2)))+
    #geom_line(data=filter(output.data,id==i),color="red",alpha=1,size=1.05)+
    guides(alpha=F)+
    geom_point(size=3,color="black",aes(y=sales))+
    theme_minimal()+
    labs(x="",y="",
         title="New home sales (1000s, SAAR)",
         subtitle="Black dots estimates,each gray line a random sample from normal with survey standard error",
           caption="@lenkiefer Source: U.S. Census Bureau and U.S. Department of Housing and Urban Development")+
    coord_cartesian(xlim=as.Date(c("2016-01-01","2017-03-01")),ylim=c(400,700))+
    theme(plot.caption=element_text(hjust=0))
  print(g)
  ani.pause()
  print(paste(i,"out of 100"))
}

},movie.name="newsales_04_26_2017 samp ex.gif",ani.width = 600, ani.height = 450)

Viz 3: Beeswarm

We can also make a beeswarm plot (for more see here).

ggplot(data=filter(output.data,year(date)>2015),
       aes(x=date,y=sales.samp,color=sales.samp))+
  scale_color_viridis(name="")+ guides(color=F)+
  geom_quasirandom()+theme_minimal()+
  geom_point(data=filter(output.data,year(date)>2015 & id==1),
             aes(y=sales),color="black",size=3) +
  scale_x_date(date_labels="%b-%Y",date_breaks="2 months")+
    labs(x="",y="",
         title="New Home Sales (1000s SAAR)",
         subtitle="Estimates (black dots) and sampling uncertainty",
         caption="@lenkiefer Source: U.S. Census Bureau and U.S. Department of Housing and Urban Development\ncolored dots represent draws from a normal distribution centered at estimate with standard error of estimate.")+
      theme(plot.caption=element_text(hjust=0))

And we could animate it:

##################################################################################
# Animate plot!
##################################################################################
oopt = ani.options(interval = 0.2)
saveGIF({for (i in 1:200) {
  g<-
    ggplot(data=filter(output.data,date>="2016-03-01" & id<=i),
           aes(x=date,y=sales.samp,color=sales.samp,
               alpha=ifelse(id==i,1,0.2) ))+
    scale_color_viridis(name="")+ guides(color=F)+
    geom_quasirandom()+theme_minimal()+
    geom_point(data=filter(output.data,date>="2016-03-01" & id==1),
               aes(y=sales),color="black",size=3,alpha=1) +
    scale_x_date(date_labels="%b-%Y",date_breaks="2 months",
                 limits=as.Date(c("2016-02-15","2017-04-15")))+
    scale_y_continuous(limits=c(400,800))+
    guides(alpha=F)+
    labs(x="",y="",
         title="New Home Sales (1000s SAAR)",
         subtitle="Estimates (black dots) and sampling uncertainty",
         caption="@lenkiefer Source: U.S. Census Bureau and U.S. Department of Housing and Urban Development\ncolored dots represent draws from a normal distribution centered at estimate with standard error of estimate.")+
      theme(plot.caption=element_text(hjust=0))
  print(g)
  ani.pause()
  print(paste(i,"out of 250"))  #counter
}

},movie.name="new home sales swarm.gif",ani.width = 600, ani.height = 450)

Conclusion

Visualizing uncertainty can be challenging. Depending on the audience, uncertainty can be a difficult concept. I’m not sure the data visualization field has a consensus on the right way to visualize uncertainty.

But communicating uncertainty can be quite important. Maybe one of these ideas could work for you.