Joyful dataviz

2017/07/26

dataviz / R

I TOOK SOME TIME OFF OVER THE SUMMER, away from data visualizations. It’s good to get away from time to time, but oh boy did I miss out.

I wasn’t gone long, but in the short time I was gone people came up with some wonderful things.

Let me dive back into it with some joyful dataviz.

Joy plots

Claus Wilke (Twitter) authored a new R package for creating joy plots ( LINK for ggjoy vignette). See also this post from Revolution Analytics with some other joyplot examples and some more background. Let’s try them out.

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

## Loading required package: viridisLite

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

## Picking joint bandwidth of 0.0138

## Warning: Removed 38 rows containing non-finite values (stat_joy).

As David Smith over at Revolution Analytics points out, the plots can obscure some data. Nevertheless I find them evocative enough that some data obfuscation might be worth the interest they create. I dunno, let’s try it out.

Data

Let’s revisit the house price data we used before here. The data will have monthly observations over more than 300 metro areas tracked in the Freddie Mac House Price Index.

We’ll pick up with a data frame called df.metro that looks like so:

htmlTable::htmlTable(rbind(tail(df.metro %>% 
                                  map_if(is_numeric,round,0) %>% 
                                  data.frame() %>% as.tbl())))

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

	date	geo	hpi	type	state	id	year	month	mname
1	2016-10-01	Yuma, AZ	144	metro	AZ	7	2016	10	Oct
2	2016-11-01	Yuma, AZ	145	metro	AZ	7	2016	11	Nov
3	2016-12-01	Yuma, AZ	145	metro	AZ	7	2016	12	Dec
4	2017-01-01	Yuma, AZ	146	metro	AZ	7	2017	1	Jan
5	2017-02-01	Yuma, AZ	146	metro	AZ	7	2017	2	Feb
6	2017-03-01	Yuma, AZ	146	metro	AZ	7	2017	3	Mar

The variable hpi is the house price index (normalized so that January 2000 = 100). The variables hpa and hpa12 are the one-month and 12-month percent changes in the house price index. The other variables tell us the date, the metro name (geo), the primary state for the metro area, the year and the month.

Distributions

Let’s construct a joyplot showing how the 12-month appreciation in house prices varies across metro areas by year.

ggplot(data=filter(df.metro,year(date)>1979 & month==3 ),
       aes(x=hpa12,y=reorder(factor(year),-year), fill= ..x.. )) +
  geom_joy_gradient(rel_min_height = 0.01,scale=3)+
  scale_fill_viridis(discrete=F)+
  labs(x="12-month percent change in house prices",y="year",
       title="Distribution of metro house price growth",
       subtitle="March of each year",
       caption="@lenkiefer Source: Freddie Mac House Price Index in March of each year, distribution across metro areas")+
  theme_minimal()+theme(legend.position="none")+
  scale_x_continuous(label=percent)

## Picking joint bandwidth of 0.0102

This plot shows the time series history of metro house price appreciation. We can see the wide dispersion during the housing bust, when some metros saw house prices decline by more than 20 percent annually.

Let’s look compare the distributions across two large states, metros in California and metros in Texas.

ggplot(data=filter(df.metro,year(date)>1979 & month(date)==3 & 
                     state %in% c("TX","CA")),
       aes(x=hpa12,y=reorder(factor(year),-year))) +
  geom_joy(rel_min_height = 0.01,alpha=0.75,aes(fill=state))+
  scale_fill_viridis(discrete=T)+
  labs(x="12-month percent change in house prices",y="year",
       title="Distribution of metro house price growth: CA and TX",
       subtitle="March of each year",
       caption="@lenkiefer Source: Freddie Mac House Price Index in March of each year, distribution across metro areas")+
  theme_minimal()+theme(legend.position="none")+
  scale_x_continuous(label=percent)+  theme(legend.position="top")

## Picking joint bandwidth of 0.0153

We can see that while Texas house prices held up pretty well during the Great Recession, many California markets saw big declines. Since then, California has rebounded and in recent years California metros have had faster house price growth than Texas metros.

How about that crazy plot?

In my first plot, I intentionally left off the labels. But it’s just the CA vs TX plot above with all 50 states + D.C. included. Let’s recreate it with a few labels.

ggplot(data=filter(df.metro,year(date)>1979 & month(date)==3 ),
       aes(x=hpa12,y=factor(year(date))))+
  geom_joy(rel_min_height = 0.01,alpha=0.75,aes(fill=state))+
  theme_minimal()+theme(legend.position="none")+
  labs(x="Annual % change in house prices",y="Year",
       title="Distribution of metro house price growth by states",
       subtitle="Each curve estimated distribution across metros in each state",
       caption="@lenkiefer Source: Freddie Mac House Price Index in March")+
  scale_x_continuous(labels=percent)

## Picking joint bandwidth of 0.0138

## Warning: Removed 38 rows containing non-finite values (stat_joy).

Home sales

I think joyplots work well if there are some important differences across groups. For example, I think they work to highlight seasonal patterns. The graph below shows monthly existing home sales, not seasonally adjusted.

## Picking joint bandwidth of 29.9

Is it useful?

Joyplots certainly are useful insofar as they make an impression. Other chart types are probably better for many applications. For example, if you really want to compare distributions good old boxplots are hard to beat unless you have a very odd distribution.

But without a doubt, making joyplots is a joyful exercise. And when is joy not useful?

Don’t discount the importance of being able to resonate with your intended audience. It might well be worth it to sacrifice some clarity if it buys us joy.