03 August 2017

I LOOK AT A LOT OF DATA and the thing about data is it’s not always what it seems to be. A lot of data are uncertain, and based on estimates.

We’ve talked about this before. See for example, this post on visualizing uncertainty in housing data. In this post I’m going to combine one of my favorite new plot types, the joyplot (see this post) with another of my favorite plot types: the beeswarm plot (see this post for more on beeswarm plots). These plots can help us visualize uncertainty.

As usual we’ll use R to create our plots.

# Setup

In this post I’m going to generate some simple random variables. I’m also going to use some purrr tricks I picked up from Jenny Bryan’s excellent purrr tutorial to help manage our simulated data.

For this example, I’m just going to simulate random variables from small number `N=10` of distributions. For each of the N distributions I’m going to draw 100 pseudo random draws. Each distribution will be a normal distribution, with a different mean and variance. We’ll draw means from a standard normal distribution and standard deviation from the uniform distribution. This will give us some “Normal” looking (see what I did there?) random variables with just enough variation to be visually interesting.

First, let’s load our libraries and draw some data.

id mean sd
1 1 -1.706 0.685
2 2 -0.292 0.761
3 3 -1.118 0.082
4 4 0.183 0.688
5 5 -1.693 0.75
6 6 1.146 1
7 7 -1.988 0.164
8 8 -0.597 0.33
9 9 -0.446 0.105
10 10 -0.39 0.818

Now we have some metadata, but what we’d like to do is generate samples for each distribution described in the table above.

This is a purrrfect time to use purrr and its powerful `map()` functions. We can generate a whole bunch of data by running the bit below:

In the code above I used the `pmap()` function with three arguments, `S` for sample size, `mean` for the population mean and `sd` for the population standard deviation. Before I called the `unnest()` function my data frame would have data frames stored in columns. Using `unnest()` allowed me to unpack the data. I ended up with 1000 observations, corresponding to 100 draws from 10 different distributions.

Now we’re ready for some visualization.

# Joyplots

### A note of forcats

It might be desirable to reorder these variables by something other than id. For example, we have the true mean saved in our dataframe. We can sort the id factors using the `forcats::fct_reorder()` function (see forcats tidyverse page). I’ve found it useful so I’m posting this bit here (you’re welcome future Len).

## Beeswarms

The joyplots are really cool, but there are other plots to show distributions. Plots like beeswarm](https://github.com/eclarke/ggbeeswarm) plots. Let’s make one using our data.

# Joyswarms

But wait. If joyplots are cool, and beeswarm plots are hot, what do we get if we put them together? Something pretty awesome I think. And it’s super easy.

Oh yeah!

And of course, they’re kind of fun to animate. Like so:

## Joyswarms in the wild

These could be useful out in the wild. I’ve been experimenting with some real world data (check my Twitter feed). And in the future, maybe I’ll share some examples here.

Oh wait, nevermind. We’ll do one now.

I just saw a tweet from the St. Louis Federal Reserve on how long new homes are staying on the market.

That’s a great candidate for a joyswarm plot. Let’s make one.

We’ll use the approach I outlined here to get the data. Let’s download the data and plot a time series.

These data are not seasonally adjusted, so you can see the pronounced seasonal variation. You can also see large volatility around the Great Recession. Let’s first create a joyplot using month on the Y axis. We’ll also add dots showing the 2017 values.

Here we can see that the 2017 values are quite low by historical standards.

Let’s try a joyswarm:

### It’s hard to beat good ol’ boxplots

Though I like joyplots, beeswarm plots, and joyswarm plots a lot, when it comes to clarity it’s hard to beat good ol’ boxplots (see below). However, these plots do create some visual interest and help stimulate thinking about the data.