Do take some time to check out the post.
There’s a bunch to unpack in the post, but for today I just want to talk about house values. Ken’s post looks at the evolution of house values within a metro area over time. There are some nice graphics in the post, but I thought I would add some additional graphics.
For this post, we’re going to use American Community Survey (ACS) from 2010 and 2015 and compare how owner-reported housing values have shifted. While we could download the public use microdata direct from Census, instead we’ll us the Integrated Public Use Microdata (IPUMS).
For this exercise we will download all the owner-occupied houses within several top metros for 2010 and 2015 respectively. IPUMS is nice because it allows us to only select the variables we need and we can subset the data as well. Unfortunately, the IPUMS data does not come in R format. So we’ll download the data using a Stata (!) format and covert it.
Fortunately we won’t have to use Stata (or SAS), but can open the Stata data using the haven package.
Let’s take a quick look. at how our IPUMS selections look:
Notice that the file size is not tiny at over 400 MB. However, we’ll be using data table and shrinking the data as we go.
Reading data and subsetting
The data file I selected was a Stata file. You can also select SAS or a flat text file. But the labeling for the Stata file worked best for me. I sometimes use SAS, but haven’t used Stata for many years. Fortunately, with haven we’ll be able to read the data into R without even having Stata.
I have saved the data as a file called usa_00005.dta in my data directory. I also want to convert the data into a data table. These data are a representative sample of the entire United States, but for this exercise I’m going to restrict the data to the top 30 metro areas.
We could look the metro areas up using a Census table, but instead, let’s use data table and some math to calculate it in sample:
That looks reasonable, but unfortunately we only have the metro numbers, not the names. The IPUMS output contains a Stata .do file that has the CBSA codes, or we could look them up. Fortunately, I have a simple lookup file cbsanames.txt that has the names. We can merge it onto our list:
Note that our data is already subset to only include homeowners, so the population we’re counting here is the number of people in homeowner households, which will give a slightly different ranking than if we weighted by total people.
Make some graphs
Now that we have our data, let’s make some graphs. Let’s first compare how the distribution of owner-reported house values shifted from 2010 to 2015. When using these data we have to be careful to remember that we have sample data. Census provides weights, so we’ll have to be sure to weight the statsitics we use.
First let’s subset the data to only include the top 12 metro areas (by our population table).
Let’s compare how house values have shifted from 2010 to 2015 with a beeswarm plot:
This graph shows the distribution of house values by metro and compares 2010 to 2015. We can see that the shapes of the distributions remain roughly constant, but values seem to be increasing from 2010 to 2015, which is consistent with generally rising prices over that time period.
Let’s catch our breath
Whew! Just organizing these data and getting them ready took a bit of work. Let’s pause and catch our breath. There’s a whole lot more that we can do with these data.
In future we’ll see what else we can glean from these housing data.