Wrangling employment data, plotting trends

We will get back to house prices soon.

IN THIS POST I WANT TO EXPLORE EMPLOYMENT TRENDS at the state and metro area. Today the U.S. Bureau of Labor Statistics (BLS) released data on state and metro area employment trends. Last month we looked at unemployment trends. Today we’ll look at trends in nonfarm employment.

Wrangling the data

We will be importing, preparing, and plotting our data with R. We can get the data pretty easily using the files BLS prepares, though we have to do a little bit of work to organize the data. Fortunately, the data wrangling isn’t too bad, and made easier by the data table package.

For details about the data files we are using check out this file.

Let’s load the data and merge on area names:

######################
## Load Libraries ##
######################
library(data.table,quietly=T)
library(tidyverse,quietly=T)
library(plotly,quietly=T)
library(scales,quietly=T)

# download big data file

emp.data<-fread("https://download.bls.gov/pub/time.series/sm/sm.data.54.TotalNonFarm.All")
## Warning in fread("https://download.bls.gov/pub/time.series/sm/sm.data.
## 54.TotalNonFarm.All"): Bumped column 4 to type character on data row
## 521667, field contains '-'. Coercing previously read values in this
## column from logical, integer or numeric back to character which may not
## be lossless; e.g., if '00' and '000' occurred before they will now be just
## '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too
## (if they occurred in this column before the bump). If this matters please
## rerun and set 'colClasses' to 'character' for this column. Please note that
## column type detection uses a sample of 1,000 rows (100 rows at 10 points)
## so hopefully this message should be very rare. If reporting to datatable-
## help, please rerun and include the output from verbose=TRUE.
# download series info

emp.series<-fread("https://download.bls.gov/pub/time.series/sm/sm.series")

emp.list<-emp.series[industry_code==0 # get all employment
           & data_type_code==1 # get employment in thousands
           & seasonal=="U",]  # get seasonally adjusted data]

emp.area<-fread("https://download.bls.gov/pub/time.series/sm/sm.area",
                col.names=c("area_code","area_name","drop"))[,c("area_code","area_name"),with=F]
## Warning in fread("https://download.bls.gov/pub/time.series/sm/sm.area", :
## Starting data input on line 2 and discarding line 1 because it has too few
## or too many items to be column names or data: area_code area_name
emp.st<-fread("https://download.bls.gov/pub/time.series/sm/sm.state",
              col.names=c("state_code","state_name","drop"))[,c("state_code","state_name"),with=F]
## Warning in fread("https://download.bls.gov/pub/time.series/sm/sm.state", :
## Starting data input on line 2 and discarding line 1 because it has too few
## or too many items to be column names or data: state_code state_name
# merge data
emp.dt<-merge(emp.data,emp.list,by="series_id",all.y=T)

#create month variable
emp.dt=emp.dt[,month:=as.numeric(substr(emp.dt$period,2,3))]
# (this assignment is to get around knitr/data table printing error)
# see e.g. http://stackoverflow.com/questions/15267018/knitr-gets-tricked-by-data-table-assignment

# M13 = Annual average, drop it:
emp.dt<-emp.dt[month<13,]

#create date variable
emp.dt$date<- as.Date(ISOdate(emp.dt$year,emp.dt$month,1) ) 

# merge on area and state codes
emp.dt<-merge(emp.dt,emp.area,by="area_code")
emp.dt<-merge(emp.dt,emp.st,by="state_code")

Now that we have these data, let’s take a quick look at the structure of our data with str():

str(emp.dt)
## Classes 'data.table' and 'data.frame':   216868 obs. of  21 variables:
##  $ state_code      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ area_code       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ series_id       : chr  "SMU01000000000000001" "SMU01000000000000001" "SMU01000000000000001" "SMU01000000000000001" ...
##  $ year            : int  1939 1939 1939 1939 1939 1939 1939 1939 1939 1939 ...
##  $ period          : chr  "M01" "M02" "M03" "M04" ...
##  $ value           : chr  "394.10" "396.90" "404.20" "388.50" ...
##  $ footnote_codes.x: chr  "" "" "" "" ...
##  $ supersector_code: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ industry_code   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_type_code  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ seasonal        : chr  "U" "U" "U" "U" ...
##  $ benchmark_year  : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ footnote_codes.y: logi  NA NA NA NA NA NA ...
##  $ begin_year      : int  1939 1939 1939 1939 1939 1939 1939 1939 1939 1939 ...
##  $ begin_period    : chr  "M01" "M01" "M01" "M01" ...
##  $ end_year        : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ end_period      : chr  "M08" "M08" "M08" "M08" ...
##  $ month           : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ date            : Date, format: "1939-01-01" "1939-02-01" ...
##  $ area_name       : chr  "Statewide" "Statewide" "Statewide" "Statewide" ...
##  $ state_name      : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "state_code"

At the moment we have a bunch of irrelevant variables for our purposes. The key variables we want to keep are value that has the area employment (in thousands), the date variables and the location variables. Let’s drop unneeded variables and add some transformations.

emp.dt=emp.dt[,c("state_name","area_name","date","year","month","value"),with=F]


emp.dt=emp.dt[,emp:=as.numeric(value)] #convert value to numeric
## Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
# Compute year-over-year change in employment and year-over-year percent change
emp.dt=emp.dt[,emp.yoy:=emp-shift(emp,12,fill=NA),by=c("area_name","state_name")]
emp.dt=emp.dt[,emp.pc:=(emp-shift(emp,12,fill=NA))/shift(emp,12,fill=NA),by=c("area_name","state_name")]
emp.dt=emp.dt[,max.emp.st:=max(emp),by=c("state_name")]
emp.dt=emp.dt[,type:=ifelse(area_name=="Statewide","State","Metro")]

# drop states in c("Puerto Rico","Virgin Islands")

emp.dt=emp.dt[year>1989  &!(state_name %in% c("Puerto Rico","Virgin Islands")),]

# compute max and min percent change by year
emp.dt=emp.dt[,pc.max:=max(emp.pc),by=c("date","type")]
emp.dt=emp.dt[,pc.min:=min(emp.pc),by=c("date","type")]

Now that we’ve added these transformations, let’s plot the data. We have 487 areas that we’re tracking (436 metro/micro areas and 50 states plus the District of Columbia). All these series are too much, so let’s restrict our attention to Ohio.

We’ll start with a time series plot of year-over-year percent changes in employment. We’ll also add recession shading using geom_rect() based on NBER Recession Dates.

#set up recessions:
recessions.df = read.table(textConnection(
  "Peak, Trough
  1857-06-01, 1858-12-01
  1860-10-01, 1861-06-01
  1865-04-01, 1867-12-01
  1869-06-01, 1870-12-01
  1873-10-01, 1879-03-01
  1882-03-01, 1885-05-01
  1887-03-01, 1888-04-01
  1890-07-01, 1891-05-01
  1893-01-01, 1894-06-01
  1895-12-01, 1897-06-01
  1899-06-01, 1900-12-01
  1902-09-01, 1904-08-01
  1907-05-01, 1908-06-01
  1910-01-01, 1912-01-01
  1913-01-01, 1914-12-01
  1918-08-01, 1919-03-01
  1920-01-01, 1921-07-01
  1923-05-01, 1924-07-01
  1926-10-01, 1927-11-01
  1929-08-01, 1933-03-01
  1937-05-01, 1938-06-01
  1945-02-01, 1945-10-01
  1948-11-01, 1949-10-01
  1953-07-01, 1954-05-01
  1957-08-01, 1958-04-01
  1960-04-01, 1961-02-01
  1969-12-01, 1970-11-01
  1973-11-01, 1975-03-01
  1980-01-01, 1980-07-01
  1981-07-01, 1982-11-01
  1990-07-01, 1991-03-01
  2001-03-01, 2001-11-01
  2007-12-01, 2009-06-01"), sep=',',
colClasses=c('Date', 'Date'), header=TRUE)

recessions.trim = subset(recessions.df, Peak >= min(emp.dt$date) )


# Plot employment series for Ohio:
g1<-
  ggplot(data=emp.dt[state_name=="Ohio" & type=="Metro"])+
  geom_rect(data=recessions.trim, aes(xmin=Peak, xmax=Trough, ymin=-Inf, ymax=+Inf), fill='gray', alpha=0.5)+
  geom_line(aes(x=date,y=emp.pc,group=area_name))+theme_minimal()+
  facet_wrap(~area_name,ncol=3)+scale_y_continuous(labels=percent)+
  geom_hline(color="black",yintercept=0)+
  labs(x="",y="",title="Annual percent change in total nonfarm employment",
       subtitle="Ohio Metro Areas (NSA)",
       caption="@lenkiefer Source: U.S. Bureau of Labor Statistics")+
  theme(plot.caption=element_text(hjust=0),
        plot.subtitle=element_text(face="italic"))

g1
## Warning: Removed 120 rows containing missing values (geom_path).

These data show that across most metro areas employment growth is picking up since the end of the Great Recession. Let’s zoom in and just look at the most recent month’s data:

g2<-
  ggplot(data=emp.dt[state_name=="Ohio" & date=="2016-12-01"])+
  geom_point(aes(y=reorder(area_name,emp.pc),x=emp.pc,group=area_name,color=type),size=3,alpha=0.82)+
  theme_minimal()+scale_color_manual(name="",values=c("black","red"))+
  #facet_wrap(~area_name,ncol=3)+
  scale_x_continuous(labels=percent)+
  geom_vline(color="black",xintercept=0)+
  labs(x="",y="",title="Annual percent change in total nonfarm employment",
       subtitle="Ohio Metro Areas: December 2016 (NSA)",
       caption="@lenkiefer Source: U.S. Bureau of Labor Statistics")+
  theme(plot.caption=element_text(hjust=0),legend.position="top",
        plot.subtitle=element_text(face="italic"))

g2

These data match those in Table 3 of the BLS press release. They show that while employment is growing in the Buckeye state, some metros are still facing shrinking employment.

Add interactions with plotly

We can add some interactions and animations with plotly. See this post for more on using plotly.

Let’s plot how state employment growth has varied in December of each year since 1990. We’ll use geom_jitter to plot each state overlaid in a strip plot. I’d like to use a beeswarm plot, but ggplotly doesn’t support it.

g3<-
  
  ggplot(data=emp.dt[area_name=="Statewide"&month==12],
         aes(x=date,y=emp.pc,color=emp.pc,frame=year,label=state_name))+
  geom_jitter(alpha=0.82,size=3)+
  scale_color_viridis(name="% Change",label=scales::percent)+
  theme_minimal()+
  scale_y_continuous(label=scales::percent)+labs(y="",x="",
                                                 title="Annual % Change in Nonfarm Employment (Dec/Dec) by State<br>@lenkiefer Source: BLS")

p<-ggplotly(g3) %>% animation_opts(frame=2000,transition=600,redraw=T) 
htmlwidgets::saveWidget(as.widget(p), "plotly-1.html")

You can see a fullscreen interactive version here.

Compare state line plots

Let’s try another animation, showing how employment growth varies by state and over time.

g4<-
  ggplot(data=emp.dt[area_name=="Statewide"], aes(x=date,y=emp.pc,group=area_name,frame=state_name))+
  geom_line()+
  theme_minimal()+
  scale_y_continuous(label=scales::percent)+labs(y="",x="",
                                                 title="Annual % Change in Nonfarm Employment (Dec/Dec) by State<br>@lenkiefer Source: BLS")

p<-ggplotly(g4) %>% animation_opts(frame=2000,transition=600,redraw=T) 
htmlwidgets::saveWidget(as.widget(p), "plotly-2.html")

You can see a fullscreen interactive version here.

 Share!