EARLEIR TODAY I HAPPENED ACROSS AN INTERESTING post by Ken Steif (twitter @kensteif) at the Urban Spatial Analysis blog that predicts gentrification using census data.

Do take some time to check out the post.

There’s a bunch to unpack in the post, but for today I just want to talk about house values. Ken’s post looks at the evolution of house values within a metro area over time. There are some nice graphics in the post, but I thought I would add some additional graphics.

For this post, we’re going to use American Community Survey (ACS) from 2010 and 2015 and compare how owner-reported housing values have shifted. While we could download the public use microdata direct from Census, instead we’ll us the Integrated Public Use Microdata (IPUMS).

### The data

For this exercise we will download all the owner-occupied houses within several top metros for 2010 and 2015 respectively. IPUMS is nice because it allows us to only select the variables we need and we can subset the data as well. Unfortunately, the IPUMS data does not come in R format. So we’ll download the data using a Stata (!) format and covert it.

Fortunately we won’t have to use Stata (or SAS), but can open the Stata data using the haven package.

Let’s take a quick look. at how our IPUMS selections look:

We’ve only selected a few key variables, particularly `valueh`

that captures the value of the house and `met2013`

that captures the Metro area of the housing unit (based on 2013 OMBS defintions).

Notice that the file size is not tiny at over 400 MB. However, we’ll be using data table and shrinking the data as we go.

## Reading data and subsetting

The data file I selected was a Stata file. You can also select SAS or a flat text file. But the labeling for the Stata file worked best for me. I sometimes use SAS, but haven’t used Stata for many years. Fortunately, with haven we’ll be able to read the data into R without even having Stata.

I have saved the data as a file called *usa_00005.dta* in my data directory. I also want to convert the data into a data table. These data are a representative sample of the entire United States, but for this exercise I’m going to restrict the data to the top 30 metro areas.

We could look the metro areas up using a Census table, but instead, let’s use data table and some math to calculate it in sample:

That looks reasonable, but unfortunately we only have the metro numbers, not the names. The IPUMS output contains a Stata *.do* file that has the CBSA codes, or we could look them up. Fortunately, I have a simple lookup file *cbsanames.txt* that has the names. We can merge it onto our list:

Note that our data is already subset to only include homeowners, so the population we’re counting here is the number of people in homeowner households, which will give a slightly different ranking than if we weighted by total people.

## Make some graphs

Now that we have our data, let’s make some graphs. Let’s first compare how the distribution of owner-reported house values shifted from 2010 to 2015. When using these data we have to be careful to remember that we have sample data. Census provides weights, so we’ll have to be sure to weight the statsitics we use.

Let’s revisit the beeswarm graphs we made last year to compare distributions.

First let’s subset the data to only include the top 12 metro areas (by our population table).

Let’s compare how house values have shifted from 2010 to 2015 with a beeswarm plot:

This graph shows the distribution of house values by metro and compares 2010 to 2015. We can see that the shapes of the distributions remain roughly constant, but values seem to be increasing from 2010 to 2015, which is consistent with generally rising prices over that time period.

# Let’s catch our breath

**Whew!** Just organizing these data and getting them ready took a bit of work. Let’s pause and catch our breath. There’s a whole lot more that we can do with these data.

In future we’ll see what else we can glean from these housing data.