06 April 2016

We’re going to make this chart (and talk about it)

metro ur dist

Wait, what is this?

Let’s pause the animation and look at the last frame:

Metro UR dist 2016

This plot shows the distribution of metro area unemployment. These data are available here.

Each dot represents a metro area with its unemployment rate depicted on the x axis. The data are bucketed into 0.25 percentage point buckets and stacked when more than one metro falls within that range. For example, El Centro California had the highest metro area unemployment rate in February 2016 of 18.6. That’s the point out on the far right.

Most of the points are clustered near the national average (5.2% NSA in Feb 2016) with a more or less reasonable distribution around the national average.

All the GIF does is collect a sequence of these images and create a gif out of them as I described here. Below I describe how I build the GIF and the code necessary. This description includes how you can download and organize the data in a few simple steps.

Go get some data

First set up some libraries:

library("ggplot2")
library("scales")
library('ggthemes')
library("data.table")
library("animation") #needed to make gif

We’re using the data.table package and its fread function to pull in some data from the BLS. Fortunately, the BLS has organized their flat files in an easily accessible frameworkk.

We’re going to be using the Local Area Unemployment Statistics series from the BLS. You can find all the flat files we’ll use here.

#The area data defines the area types for the Local Area Unemployment Stats
blsarea <-fread("http://download.bls.gov/pub/time.series/la/la.area",
                header=FALSE,col.names=c("area.type.code","area_code","area.text","display.level",
                                         "selectable","sort.sequence","blank"))


#The series data file tells us about the individual series
blsseries <-fread("http://download.bls.gov/pub/time.series/la/la.series")

#The measure series tells us which series is which data concept, e.g. Unemployment Rate
blsmeas<-fread("http://download.bls.gov/pub/time.series/la/la.measure",col.names=c("measure_code","measure","blank"))

#This large file gives us all the metro area stats stacked together
blsmetro<-fread("http://download.bls.gov/pub/time.series/la/la.data.60.Metro")

Now that we have our data downloaded, we can start to manipulate it to create our images and our gif. I’ve been using the excellent data.table package for R that makes these operations pretty easy. As I’ve been busy with other things my coding skills have atrophied a bit, so I’m not claiming great efficiency, but I get the job done.

First we’re going to merge together our data to create one data table we can work with. Strictly speaking these steps aren’t necessary, but including unnecessary intermediary steps helps me think through the logic. Perhaps in future we’ll review this and enhance it for efficiency.

We’re going to create a large dataset that combines the raw data with the series id, the area codes, and the measure codes. These things allow us to more naturally subset the data later.

blsbig<-merge(blsmetro,blsseries,by="series_id")
blsbig<-merge(blsbig,blsarea,by="area_code")
blsbig<-merge(blsbig,blsmeas, by="measure_code")

The BLS files don’t include usable dates, but they do include year and month variables. We’ll combine them to make dates we can use.

#Creat some usable dates:
blsbig$M<-substr(blsbig$period,2,3)
blsbig$date<- as.Date(ISOdate(blsbig$year,blsbig$M,1) )

Now we’ll create a function to make our plots for each period. The function, called mydotf, will take a dataset as an input and create a single dot plot image.

I’m excited to be able to use the new subtitle functions. I’m not sure if they are in the CRAN version of ggplot yet, but they are in the development version you can get from github. You can read about the subtitle and caption options in this excellent post from @hrbrmstr.

mydotf<-function (d){
myhist<-hist(d$value,plot=FALSE,breaks=c(seq(0,28,0.25) ))
N<-length(myhist$mids)
g<-ggplot()
j<-1
i<-1

g<-ggplot(data=data.frame(x=myhist$mids[i],y=j),
          aes(x=x,y=y))+theme_minimal()
for (i in 1:N){
  for (j in 1:myhist$counts[i])
  {if (myhist$counts[i]>0){
    g<-g+geom_point(data=data.frame(x=myhist$mids[i],y=j), aes(x=x,y=y),size=2,color="#00B0F0")}
  }
}

mydate<-as.character(unique(d$date), format="%b-%Y")
gg<-g+
  ylab("Number of Metros")+xlab("Unemployment Rate (%, NSA)") +
  scale_x_continuous(breaks=seq(0,28,1),limits=c(0,28))+
  scale_y_continuous(breaks=seq(0,50,10),limits=c(0,50))+
  #use the new caption and subtitle features
  labs(x="Unemployment Rate (%,NSA)", y="Count of metros",
       title=paste("Distribution of Metro Area Unemployment Rates in",
                   as.character(unique(d$date), format="%b-%Y")),
       subtitle="Each dot a metro area",
       caption="@lenkiefer Source: BLS")+
  theme(plot.title=element_text(margin=margin(b=10)))+
  theme(plot.subtitle=element_text(face="italic"))+
  #move the caption over to the left
  theme(plot.caption=element_text(size=8, hjust=0, margin=margin(t=15)))
return (gg)
}

This function takes a data table as an input and generates a dotplot from the data. We supply a single month’s worth of data to the function. There’s a dotplot function in ggplot2, but I found I couldn’t get it to work like I wanted, so I built my own from scatterplots. The function bins the unemployment rate (value) and then loops through each metro area stacking the dots as we go. Certainly could be more efficient, but this gets it done.

Make a MOOOVIE

We’re about done. All we have to do is use the animation package and loop through the years we want using our function to draw the dots.

oopt = ani.options(interval = 0.55)
saveGIF({for (yy in 2000:2016){
dy<-blsbig[year==yy & period=="M02" & area.type.code=="B" & measure_code==3]
gy<-mydotf(dy)
print(gy)
ani.pause()
}
  for (i2 in 1:10) {
    print(gy)
    ani.pause()
  }
},movie.name="awesome_dots_gif.gif",ani.width = 750, ani.height = 450)