16 April 2017

BACK IN JANUARY WE LOOKED AT HOUSING microdata from the American Community Survey Public Microdata that we collected from IPUMS. Let’s pick back up and look at these data some more. Glad you could join us.

Be sure to check out my earlier post for more discussion of the underlying data. Here we’ll pick up where we left off and make some more graphs using R.

Just a quick reminder (read the earlier post for all the details), we have a dataset that includes household level observations for the 20 largest metro areas in the United States for 2010 and 2015 (latest data available).

Below we load the data and check out its structure:

Load data

library(data.table)
library(tidyverse)
library(ggbeeswarm)
load("data/acs.RData")
str(dt.x)
## Classes 'data.table' and 'data.frame':	540325 obs. of  6 variables:
##  $ year     :Class 'labelled'  atomic [1:540325] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##   .. ..- attr(*, "label")= chr "census year"
##   .. ..- attr(*, "format.stata")= chr "%8.0g"
##   .. ..- attr(*, "labels")= Named num [1:30] 1850 1860 1870 1880 1900 1910 1920 1930 1940 1950 ...
##   .. .. ..- attr(*, "names")= chr [1:30] "1850" "1860" "1870" "1880" ...
##  $ cbsa.name: chr  "Atlanta-Sandy Springs-Roswell, GA Metro Area" "Atlanta-Sandy Springs-Roswell, GA Metro Area" "Atlanta-Sandy Springs-Roswell, GA Metro Area" "Atlanta-Sandy Springs-Roswell, GA Metro Area" ...
##  $ met2013  :Class 'labelled'  atomic [1:540325] 12060 12060 12060 12060 12060 ...
##   .. ..- attr(*, "label")= chr "metropolitan area, 2013 omb delineations"
##   .. ..- attr(*, "format.stata")= chr "%12.0g"
##   .. ..- attr(*, "labels")= Named num [1:295] 0 10420 10580 10740 10780 ...
##   .. .. ..- attr(*, "names")= chr [1:295] "not in identifiable area" "akron, oh" "albany-schenectady-troy, ny" "albuquerque, nm" ...
##  $ hhincome : atomic  17910 128000 110700 31000 70750 ...
##   ..- attr(*, "label")= chr "total household income"
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ valueh   : atomic  85000 150000 230000 78000 120000 300000 100000 70000 135000 150000 ...
##   ..- attr(*, "label")= chr "house value"
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ hhwt     : atomic  74 82 87 58 73 279 83 103 113 80 ...
##   ..- attr(*, "label")= chr "household weight"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  - attr(*, "sorted")= chr "met2013"
##  - attr(*, ".internal.selfref")=<externalptr>

I’ve cleaned the data up a bit to only include household level observations for owner households for the largest 20 metro areas. I’ve saved a data.table() called dt.x using the data described in the earlier post.

In the prior post we filtered to only the top 12 metro areas. But I’ve prepared data that filters to the top 20. If you are following along from the earlier post, just replace dt.x<-dt2[cbsa.name %in% pop.list[order(-pop)]$cbsa.name[1:12] & pernum==1] with dt.x<-dt2[cbsa.name %in% pop.list[order(-pop)]$cbsa.name[1:20] & pernum==1] and everything else should follow.

As before, let’s randomly sample 2,000 observations from these 20 large metro areas using the household weights.

# First draw a random sample of 2,000 observations from each year/metro combination
dt.samp<-dt.x[,.SD[sample(.N,min(.N,2000),prob=hhwt)],by = c("year","cbsa.name") ]

#Get our list:
metro.list<-dt.samp[,list(obs=.N),by=c("year","met2013","cbsa.name")]

# check observations:
# obs is number of observations
htmlTable::htmlTable(metro.list,col.rgroup = c("none", "#F7F7F7"))
year met2013 cbsa.name obs
1 2010 12060 Atlanta-Sandy Springs-Roswell, GA Metro Area 2000
2 2015 12060 Atlanta-Sandy Springs-Roswell, GA Metro Area 2000
3 2010 14460 Boston-Cambridge-Newton, MA-NH Metro Area 2000
4 2015 14460 Boston-Cambridge-Newton, MA-NH Metro Area 2000
5 2010 16980 Chicago-Naperville-Elgin, IL-IN-WI Metro Area 2000
6 2015 16980 Chicago-Naperville-Elgin, IL-IN-WI Metro Area 2000
7 2010 19100 Dallas-Fort Worth-Arlington, TX Metro Area 2000
8 2015 19100 Dallas-Fort Worth-Arlington, TX Metro Area 2000
9 2010 19740 Denver-Aurora-Lakewood, CO Metro Area 2000
10 2015 19740 Denver-Aurora-Lakewood, CO Metro Area 2000
11 2010 19820 Detroit-Warren-Dearborn, MI Metro Area 2000
12 2015 19820 Detroit-Warren-Dearborn, MI Metro Area 2000
13 2010 26420 Houston-The Woodlands-Sugar Land, TX Metro Area 2000
14 2015 26420 Houston-The Woodlands-Sugar Land, TX Metro Area 2000
15 2010 31080 Los Angeles-Long Beach-Anaheim, CA Metro Area 2000
16 2015 31080 Los Angeles-Long Beach-Anaheim, CA Metro Area 2000
17 2010 33100 Miami-Fort Lauderdale-West Palm Beach, FL Metro Area 2000
18 2015 33100 Miami-Fort Lauderdale-West Palm Beach, FL Metro Area 2000
19 2010 33460 Minneapolis-St. Paul-Bloomington, MN-WI Metro Area 2000
20 2015 33460 Minneapolis-St. Paul-Bloomington, MN-WI Metro Area 2000
21 2010 35620 New York-Newark-Jersey City, NY-NJ-PA Metro Area 2000
22 2015 35620 New York-Newark-Jersey City, NY-NJ-PA Metro Area 2000
23 2010 37980 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area 2000
24 2015 37980 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area 2000
25 2010 38060 Phoenix-Mesa-Scottsdale, AZ Metro Area 2000
26 2015 38060 Phoenix-Mesa-Scottsdale, AZ Metro Area 2000
27 2010 40140 Riverside-San Bernardino-Ontario, CA Metro Area 2000
28 2015 40140 Riverside-San Bernardino-Ontario, CA Metro Area 2000
29 2010 41180 St. Louis, MO-IL Metro Area 2000
30 2015 41180 St. Louis, MO-IL Metro Area 2000
31 2010 41740 San Diego-Carlsbad, CA Metro Area 2000
32 2015 41740 San Diego-Carlsbad, CA Metro Area 2000
33 2010 41860 San Francisco-Oakland-Hayward, CA Metro Area 2000
34 2015 41860 San Francisco-Oakland-Hayward, CA Metro Area 2000
35 2010 42660 Seattle-Tacoma-Bellevue, WA Metro Area 2000
36 2015 42660 Seattle-Tacoma-Bellevue, WA Metro Area 2000
37 2010 45300 Tampa-St. Petersburg-Clearwater, FL Metro Area 2000
38 2015 45300 Tampa-St. Petersburg-Clearwater, FL Metro Area 2000
39 2010 47900 Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area 2000
40 2015 47900 Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area 2000

As before, we can use a beeswarm plot to plot the distribution of house values by metro area.

#make a beewarm plot:
ggplot(data=dt.samp,
         aes(y=factor(year),
             x=valueh,color=log(valueh)))+
  geom_quasirandom(alpha=0.5,size=0.5)+
  theme_minimal()+
  scale_color_viridis(name="House Value\n$,log scale\n",discrete=F,option="D",end=0.95,direction=-1,
                      limits=c(log(10000),log(1.4e6)),
                      breaks=c(log(10000),log(100000),log(1e6)),
                      labels=c("$10k","$100k","$1,000k") ) +
  scale_x_log10(limits=c(10000,1.4e6),breaks=c(10000,100000,1000000),
                labels=c("$10k","$100k","$1,000k") )+
  labs(y="",x="House Value ($, log scale)",
       caption="@lenkiefer Source: Census 1-year American Community Survey (2010 & 2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.",
       title="House value distribution by Metro")+
  theme(axis.text.x = element_text(size=6),
        strip.text.x = element_text(size = 5),
        plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)),
        plot.title=element_text(size=14),
        legend.position = "top"   )+
  facet_wrap(~cbsa.name,ncol=4)+theme()

plot of chunk 04-16-2017-swarm-1

We see quite a bit of variation across metro areas. But how does the distribution of house values compare to the distribution of incomes? Let’s look at Washington D.C. and plot estimates of homeowner household income versus the value of those homeowner’s homes. Note, we’ve already subsetted on homeowner households, so this metric is different from a more commonly house value to income ratio that uses all households. We are excluding renters from this analysis

  ggplot(data=dt.x[year==2015 & met2013==47900,], aes(valueh,weight=hhwt))+
  geom_density(alpha=0.5,aes(fill="House Value"))+
  geom_density(aes(hhincome,fill="Household Income"),alpha=0.5)+
  scale_color_manual(name="",values=c("black","black"))+
  scale_fill_viridis(discrete=T,end=0.85,name="")+
  scale_x_log10(label=scales::dollar,limits=c(10000,2e6))+
  theme_minimal()+labs(x="Income and Home Values",y="",
                        title="House value and homeowner income distribution in Washington D.C. in 2015",
                       subtitle="kernel density estimates using household weights",
                       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")+
    theme(plot.title=element_text(size=14),legend.position="top",
          axis.text.y=element_blank(),
          plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  facet_wrap(~cbsa.name,scales="free_y")

plot of chunk 04-16-2017-dens-1

And we can construct a small multiple:

  ggplot(data=dt.x[year==2015 & met2013>=4900,], aes(valueh,weight=hhwt))+
  geom_density(alpha=0.5,aes(fill="House Value"))+
  geom_density(aes(hhincome,fill="Household Income"),alpha=0.5)+
  scale_color_manual(name="",values=c("black","black"))+
  scale_fill_viridis(discrete=T,end=0.85,name="")+
  scale_x_log10(label=scales::dollar,limits=c(10000,3e6))+
  theme_minimal()+labs(x="Income and Home Values",y="",
                        title="House value and homeowner income distribution by Metro in 2015",
                       subtitle="kernel density estimates using household weights",
                       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")+
    theme(plot.title=element_text(size=14),legend.position="top",
          axis.text.y=element_blank(),
          strip.text.x = element_text(size = 5),
          plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  facet_wrap(~cbsa.name,scales="free_y",ncol=4)

plot of chunk 04-16-2017-dens-2

But how does the income of each homeowner compare to the value of their home? Let’s take each homeowner in our sample and construct the ratio of their house value to their income.

dt.x<-dt.x[,pti:=valueh/hhincome]  #price to income ratio

Now we can plot the distribution of this ratio for Washington D.C.:

  ggplot(data=dt.x[year==2015 & met2013==47900,], aes(pti,weight=hhwt))+
  geom_density(alpha=0.5,aes(fill="House value to income ratio"))+
  #geom_density(aes(hhincome,fill="Household Income"),alpha=0.5)+
  scale_color_manual(name="",values=c("black","black"))+
  scale_fill_viridis(discrete=T,end=0.85,name="")+
  scale_x_continuous(limits=c(0,20),breaks=seq(0,20,2))+
  theme_minimal()+labs(x="House vaue to income ratio",y="",
                        title="Ratio of house value to homeowner income distribution in Washington D.C. in 2015",
                       subtitle="kernel density estimates using household weights",
                       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")+
    theme(plot.title=element_text(size=14),legend.position="none",
          axis.text.y=element_blank(),
          plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  facet_wrap(~cbsa.name,scales="free_y")

plot of chunk 04-16-2017-dens-3

The average ratio is about 4 in Washington D.C., but it varies a lot by metro area. Let’s create another small multiple.

  ggplot(data=dt.x[year==2015 & met2013>=4900,], aes(pti,weight=hhwt))+
  geom_density(alpha=0.5,aes(fill="House value to income ratio"))+
  #geom_density(aes(hhincome,fill="Household Income"),alpha=0.5)+
  scale_color_manual(name="",values=c("black","black"))+
  scale_fill_viridis(discrete=T,end=0.85,name="")+
  scale_x_continuous(limits=c(0,20),breaks=seq(0,20,2))+
  theme_minimal()+labs(x="House vaue to income ratio",y="",
                        title="Ratio of house value to homeowner income distribution by metro in 2015",
                       subtitle="kernel density estimates using household weights",
                       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")+
    theme(plot.title=element_text(size=14),legend.position="none",
          axis.text.y=element_blank(),
          strip.text.x = element_text(size = 5),
          plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  facet_wrap(~cbsa.name,scales="free_y",ncol=4)

plot of chunk 04-16-2017-dens-4

Let’s compute the weighted average ratio (excluding any values less than 0 or greater than 20) by metro area in 2010 and 2015:

dt.wsum<-dt.x[pti>0 & pti<20,list(pti.wm=weighted.mean(pti,hhwt,na.rm=T)),by=c("year","cbsa.name")]

ggplot(data=dt.wsum[year==2015,], 
       aes(x=pti.wm,y=reorder(cbsa.name,-pti.wm),
           label=paste("  ",round(pti.wm,1),cbsa.name),
           color=factor(year)))+
  geom_point(size=3)+theme_minimal()+
  geom_text(hjust=0)+
  scale_x_continuous(limits=c(2.5,9),breaks=seq(2,8,1))+
  scale_color_viridis(name="Year",discrete=T,end=0.85)+
  theme(plot.title=element_text(size=14),legend.position="none",
        axis.text.y=element_blank(),
        plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  labs(x="Weighted average homeowner value to household income ratio",
       y="Metro",
       title="Ratio of house value to homeowner income distribution by metro in 2015",
       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")

plot of chunk 04-16-2017-ratio

This plot makes clear that there is a pretty wide variation in the average house value to household income ratio. We can look at the full distributions through an animated gif (using our usual tweenr approach).

library(tweenr)
library(animation)
library(tidyverse)
library(tidyr)



#  Function for use with tweenr
myf<-function (m, yy=2015){
  d.out<-copy(dt.samp)[met2013==m]
  d.out %>% map_if(is.character, as.factor) %>% as.data.frame -> d.out
  return(d.out)
}

# get list of metros from our summay data: metro.list

# Circle back to Atlanta (met2013==12060)
my.list2<-lapply(c(unique(metro.list$met2013),12060),myf)  


#use tweenr to interploate
tf <- tween_states(my.list2,tweenlength= 3,
                   statelength=2, ease=rep('cubic-in-out',2),nframes=200)
tf<-data.table(tf) #convert output into data table

#Animate plot
oopt = ani.options(interval = 0.2)
saveGIF({for (i in 1:max(tf$.frame)) { #loop over frames
  g<-
    ggplot(data=tf[.frame==i & pti<= 20 & hhincome>10000 & valueh > 5000 ],
           aes(y=factor(year),
               x=pti,color=pti))+
    geom_quasirandom(alpha=0.5,size=0.75)+
    theme_minimal()+
    scale_color_viridis(name="House value to\nhousehold income\nratio",
                        discrete=F,option="D",
                        limits=c(0,20),
                        end=0.95,direction=-1  #,
                        ) +
    #scale_x_log10(limits=c(10000,1.4e6),breaks=c(10000,100000,1000000),   
    #  labels=c("$10k","$100k","$1,000k") )+
    scale_x_continuous(limits=c(0,20),breaks=seq(0,20,2.5))+
    theme(plot.title=element_text(size=14))+
    theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
    theme(plot.margin=unit(c(0.25,0.25,0.25,0.25),"cm"))+
    theme(legend.position = "top")+
    labs(y="",x="House value to household income ratio",
         caption="@lenkiefer Source: Census 1-year American Community Survey (2010 & 2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.\nTo avoid extreme overplotting, 2000 observations sampled at random (using weights),\nonly includes homeowner households with > $10,000 income and an estimated house value > $5,000 & cases where ratio <= 20",
         title="House value to income ratio distribution by Metro",
         subtitle=head(tf[.frame==i,],1)$cbsa.name)+
    theme(axis.text.x = element_text(size=8))+  #coord_flip()+
    #facet_wrap(~year)+
    theme(strip.text.x = element_text(size = 6))
  print(g)
  ani.pause()
  print(i)}
},movie.name="tween acs value 04 16 2017 v2.gif",ani.width = 750, ani.height = 400)

Run it and you get:

metro pti ratios