IN THIS POST I WANT TO SHARE SOME R data wrangling strategy and use it to prepare an update to some global house price plots I shared last year.

In last year’s post I did some data manipulation by hand and mouse in Excel before getting into R. In this post I’m going to use the newly updated readxl library to do the data manipulations entirely in R. If you follow along, then you should be able to use this code to recreate my graphs.

This should prove to be a major improvement over my previous workflows. Because of the realities of my professional life I cannot escape Excel. Actually, I probably wouldn’t want to escape it even if I could. Despite its limitations and frequent misuse it is still my favorite application.

However, while I’ve gotten pretty fast at doing some fairly complex manipulations in Excel (see this post from last year for an example) it isn’t exactly reproducible. Using a scripting language in R should help with making the workflow more reproducible. Also, by using a scripting language you can automate tasks and easily scale to larger problems. I often encounter situations where I would need to work with hundreds of workbooks/worksheets.

I’m going to lean heavily on the ideas in this post on the readxl workflow to prepare our data.

Some data

Let’s update our global house price charts and make a couple new ones.

First we’re going to need to gather some data on house prices. Fortunately, the Dallas Federal Reserve Bank has compiled statistics on international house price trends. Fed researchers have gone through the hard work of collecting data for many countries and harmonizing the series so they are more easily comparable. Read about their hard work and the details here.

The data are available in a convenient spreadsheet (2016Q4 data). We’re going to proceed assuming that the hp1604.xlsx spreadsheet is saved in a data folder.

Using readxl

I began this update attempting to read the data in the following manner (which works). In this example, which we will improve upon, I first read in each separate sheet and then do some data manipulations.

First load our libraries:

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(ggthemes,quietly=T,warn.conflicts=F)
library(readxl)
library(stringr)

Now we could do things this way:

###############################################################################
#### Read in HPI data  ##########################################
df<-read_excel("data/hp1604.xlsx", sheet = "HPI")
colnames(df)[1]<-"cycle"  # rename first column
df$year<-substr(df$cycle,1,4) #create a year
df$month<-substr(df$cycle,7,7) #create a
df$date<-as.Date(ISOdate(df$year,df$month,1))
df %>% select(-X__2,-cycle,-year,-month) %>% 
  gather(country,hpi,-date) ->hpi.df
###############################################################################

###############################################################################
#### Read in Real HPI data  ##########################################
df<-read_excel("data/hp1604.xlsx", sheet = "RHPI")
colnames(df)[1]<-"cycle"
df$year<-substr(df$cycle,1,4)
df$month<-substr(df$cycle,7,7)
df$date<-as.Date(ISOdate(df$year,df$month,1))
df %>% select(-X__2,-cycle,-year,-month) %>% 
  gather(country,rhpi,-date) ->rhpi.df
###############################################################################

###############################################################################
#### Read in Disposable Income data  ############################
df<-read_excel("data/hp1604.xlsx", sheet = "PDI")
colnames(df)[1]<-"cycle"
df$year<-substr(df$cycle,1,4)
df$month<-substr(df$cycle,7,7)
df$date<-as.Date(ISOdate(df$year,df$month,1))
df %>% select(-X__2,-cycle,-year,-month) %>% 
  gather(country,pdi,-date) ->pdi.df
###############################################################################

###############################################################################
#### Read in Rea Disposable Income data  ############################
df<-read_excel("data/hp1604.xlsx", sheet = "RPDI")
colnames(df)[1]<-"cycle"
df$year<-substr(df$cycle,1,4)
df$month<-substr(df$cycle,7,7)
df$date<-as.Date(ISOdate(df$year,df$month,1))
df %>% select(-X__2,-cycle,-year,-month) %>% 
  gather(country,rpdi,-date) ->rpdi.df
###############################################################################

###############################################################################
#### Merge data together data  ############################
###  Requires data.table() package ************************
dt<-merge(hpi.df,rhpi.df,by=c("date","country"))
dt<-merge(dt,pdi.df,by=c("date","country"))
dt<-merge(dt,rpdi.df,by=c("date","country"))
dt<-data.table(dt)[year(date)>0,]
###############################################################################

While this works, and isn’t too bad, it could get really old if we had 10s or hundreds of worksheets to chew through. Fortunately, we have a better alternatively using the readxl workflow combined with map_df from the purrr library (available in the tidyverse).

# path to data
path <-"data/hp1604.xlsx"

###########################################################################
#  We'll omit this step, hard coding instead

#  xl.list<-path %>% excel_sheets()   #get a list of sheet names
###########################################################################

df <- c("HPI","RHPI","PDI","RPDI") %>%   #iterate over four sheets
      set_names() %>% 
  
      # Now for something magical!
  
      map_df(~ read_excel(path = path, sheet = .x), .id = "sheet")

# print a table using the htmlTable library, round numeric to 0 digits for readability 
# Note we won't round in analysis)

htmlTable::htmlTable(rbind(head(df %>% 
                                  map_if(is_numeric,round,0) %>% 
                                  data.frame() %>% as.tbl())))
## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated
sheet X__1 Australia Belgium Canada Switzerland Germany Denmark Spain Finland France UK Ireland Italy Japan S..Korea Luxembourg Netherlands Norway New.Zealand Sweden US S..Africa Croatia Israel X__2 Aggregate
1 HPI
2 HPI 1975:Q1 8 15 16 49 52 16 9 13 11 6 3 8 60 9 7 15 13 7 15 17 3 0 0 22
3 HPI 1975:Q2 8 16 16 48 53 16 10 14 12 6 4 8 60 9 8 16 13 7 15 18 3 0 0 23
4 HPI 1975:Q3 8 17 17 48 53 17 10 14 12 6 4 8 61 10 9 17 14 8 16 18 3 0 0 23
5 HPI 1975:Q4 8 18 17 47 54 17 11 14 12 6 4 8 61 10 10 18 14 8 16 18 3 0 0 23
6 HPI 1976:Q1 9 19 18 46 55 18 11 14 13 6 4 8 61 11 10 19 14 8 17 18 3 0 0 24

Now we have data. The first column sheet has values of c("HPI","RHPI","PDI","RPDI") corresponding to the four sheets we read in. We also have the date variable stored in the X__1 column. It’s not a date, so let’s correct that.

df$year<-substr(df$X__1,1,4)
df$month<-substr(df$X__1,7,7)

#quarterly data so multiply month by 3
df$date<-as.Date(ISOdate(df$year,as.numeric(df$month)*3,1)) 

# Now we don't need year, month or X__1 any more, let's drop them
# we also don't need X__2 which corresponds to a blank column so let's drop it
df<-select(df,-year,-month,-X__1,-X__2)
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    676 obs. of  26 variables:
##  $ sheet      : chr  "HPI" "HPI" "HPI" "HPI" ...
##  $ Australia  : num  NA 7.6 7.74 8.04 8.29 ...
##  $ Belgium    : num  NA 15.2 15.9 16.7 17.7 ...
##  $ Canada     : num  NA 16.2 16.4 17.1 17.3 ...
##  $ Switzerland: num  NA 48.8 48.2 47.7 47.1 ...
##  $ Germany    : num  NA 52 52.7 53.3 54.1 ...
##  $ Denmark    : num  NA 15.8 16.2 17 17.1 ...
##  $ Spain      : num  NA 8.71 9.75 9.96 10.7 ...
##  $ Finland    : num  NA 13.5 13.6 13.8 14.2 ...
##  $ France     : num  NA 11.1 11.5 11.9 12.4 ...
##  $ UK         : num  NA 5.92 6.03 6.15 6.25 ...
##  $ Ireland    : num  NA 3.42 3.65 3.87 4.08 ...
##  $ Italy      : num  NA 8.19 8.06 7.97 7.91 ...
##  $ Japan      : num  NA 60.1 60.3 60.5 60.7 ...
##  $ S. Korea   : num  NA 8.7 9.23 9.82 10.43 ...
##  $ Luxembourg : num  NA 7.49 8.19 8.87 9.53 ...
##  $ Netherlands: num  NA 14.9 15.8 16.8 18 ...
##  $ Norway     : num  NA 13.4 13.5 13.7 14.1 ...
##  $ New Zealand: num  NA 7.42 7.43 7.54 7.71 ...
##  $ Sweden     : num  NA 14.6 15.1 15.6 16.2 ...
##  $ US         : num  NA 17.3 17.6 17.6 18 ...
##  $ S. Africa  : num  NA 3.23 3.28 3.29 3.37 ...
##  $ Croatia    : num  NA 6.88e-05 7.22e-05 7.57e-05 7.97e-05 ...
##  $ Israel     : num  NA 0.00342 0.00344 0.00357 0.00387 ...
##  $ Aggregate  : num  NA 22.5 22.8 23.1 23.4 ...
##  $ date       : Date, format: NA "1975-03-01" ...

Now we have a data.frame() (or tbl) that has 26 columns. The first column correspons to the sheet in the original .xlsx file and the last column is our date variable.

Gather the data

Now let’s tidy the data by gathering the country columns:

df2<-df %>% gather(country, value, 2:25) # I happen to know data is in colums 2:25

# same business for printing
htmlTable::htmlTable(head(df2 %>% map_if(is_numeric,round,0) %>% data.frame() %>% as.tbl()))
## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated
sheet date country value
1 HPI Australia
2 HPI 1975-03-01 Australia 8
3 HPI 1975-06-01 Australia 8
4 HPI 1975-09-01 Australia 8
5 HPI 1975-12-01 Australia 8
6 HPI 1976-03-01 Australia 9

This dataset is pretty tidy, but for the analysis we want to do not the easiest to work with.

Spread the data

We have the variable names stored in the sheet column, but we’d like to spread those variables out. Let’s spread it out:

df3<-df2 %>% spread(sheet,value)

# same business for printing
htmlTable::htmlTable(head(df3 %>% map_if(is_numeric,round,0) %>% data.frame() %>% as.tbl()))
## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated
date country HPI PDI RHPI RPDI
1 1975-03-01 Aggregate 22 19 66 60
2 1975-03-01 Australia 8 14 39 73
3 1975-03-01 Belgium 15 23 45 66
4 1975-03-01 Canada 16 19 59 71
5 1975-03-01 Croatia 0 0 73 8
6 1975-03-01 Denmark 16 19 57 70

Now we have a dataset that will be easy to work with in near future. We’ve got a date variable and a country indicator. We also have four columns corresponding to the four spreadsheets:

  • HPI : nominal house price index (not adjusted for inflation)
  • RHPI : real house price index
  • PDI : nominal personal disposable income (not adjusted for inflation)
  • RPDI : real personal disposable income

Now we can create some plots!

Data Wrangling

Using readxl and the tidyverse I was able to do some nifty data wrangling. In the past I have been able to do these things in Excel or through brute force with some repetitive code. The functions of the tidyverse, particularly from readxl and purrr enable us to get these data into shape in a compact, easily readable, reproducible, and elegant fashion.

Next time I face a thorny Excel problem I can modify this approach and apply it. How could it work for you?

Bonus: Some gifs

Below, without additional commentary, I leave a few animated gifs.

global hpi gif

global hpi gif

global hpi gif

global hpi gif