Of kernels and beeswarms: Comparing the distribution of house values to household income

2017/04/16

mortgage / dataviz / statistics / microdata

BACK IN JANUARY WE LOOKED AT HOUSING microdata from the American Community Survey Public Microdata that we collected from IPUMS. Let’s pick back up and look at these data some more. Glad you could join us.

Be sure to check out my earlier post for more discussion of the underlying data. Here we’ll pick up where we left off and make some more graphs using R.

Just a quick reminder (read the earlier post for all the details), we have a dataset that includes household level observations for the 20 largest metro areas in the United States for 2010 and 2015 (latest data available).

Below we load the data and check out its structure:

Load data

library(data.table)
library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## between():   dplyr, data.table
## filter():    dplyr, stats
## first():     dplyr, data.table
## lag():       dplyr, stats
## last():      dplyr, data.table
## transpose(): purrr, data.table

library(ggbeeswarm)
library(viridis)

## Loading required package: viridisLite

load("data/acs.RData")
str(dt.x)

## Classes 'data.table' and 'data.frame':   540325 obs. of  6 variables:
##  $ year     :Class 'labelled'  atomic [1:540325] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##   .. ..- attr(*, "label")= chr "census year"
##   .. ..- attr(*, "format.stata")= chr "%8.0g"
##   .. ..- attr(*, "labels")= Named num [1:30] 1850 1860 1870 1880 1900 1910 1920 1930 1940 1950 ...
##   .. .. ..- attr(*, "names")= chr [1:30] "1850" "1860" "1870" "1880" ...
##  $ cbsa.name: chr  "Atlanta-Sandy Springs-Roswell, GA Metro Area" "Atlanta-Sandy Springs-Roswell, GA Metro Area" "Atlanta-Sandy Springs-Roswell, GA Metro Area" "Atlanta-Sandy Springs-Roswell, GA Metro Area" ...
##  $ met2013  :Class 'labelled'  atomic [1:540325] 12060 12060 12060 12060 12060 ...
##   .. ..- attr(*, "label")= chr "metropolitan area, 2013 omb delineations"
##   .. ..- attr(*, "format.stata")= chr "%12.0g"
##   .. ..- attr(*, "labels")= Named num [1:295] 0 10420 10580 10740 10780 ...
##   .. .. ..- attr(*, "names")= chr [1:295] "not in identifiable area" "akron, oh" "albany-schenectady-troy, ny" "albuquerque, nm" ...
##  $ hhincome : atomic  17910 128000 110700 31000 70750 ...
##   ..- attr(*, "label")= chr "total household income"
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ valueh   : atomic  85000 150000 230000 78000 120000 300000 100000 70000 135000 150000 ...
##   ..- attr(*, "label")= chr "house value"
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ hhwt     : atomic  74 82 87 58 73 279 83 103 113 80 ...
##   ..- attr(*, "label")= chr "household weight"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  - attr(*, "sorted")= chr "met2013"
##  - attr(*, ".internal.selfref")=<externalptr>

I’ve cleaned the data up a bit to only include household level observations for owner households for the largest 20 metro areas. I’ve saved a data.table() called dt.x using the data described in the earlier post.

In the prior post we filtered to only the top 12 metro areas. But I’ve prepared data that filters to the top 20. If you are following along from the earlier post, just replace dt.x<-dt2[cbsa.name %in% pop.list[order(-pop)]$cbsa.name[1:12] & pernum==1] with dt.x<-dt2[cbsa.name %in% pop.list[order(-pop)]$cbsa.name[1:20] & pernum==1] and everything else should follow.

As before, let’s randomly sample 2,000 observations from these 20 large metro areas using the household weights.

# First draw a random sample of 2,000 observations from each year/metro combination
dt.samp<-dt.x[,.SD[sample(.N,min(.N,2000),prob=hhwt)],by = c("year","cbsa.name") ]

#Get our list:
metro.list<-dt.samp[,list(obs=.N),by=c("year","met2013","cbsa.name")]

# check observations:
# obs is number of observations
htmlTable::htmlTable(metro.list,col.rgroup = c("none", "#F7F7F7"))

	year	met2013	cbsa.name	obs
1	2010	12060	Atlanta-Sandy Springs-Roswell, GA Metro Area	2000
2	2015	12060	Atlanta-Sandy Springs-Roswell, GA Metro Area	2000
3	2010	14460	Boston-Cambridge-Newton, MA-NH Metro Area	2000
4	2015	14460	Boston-Cambridge-Newton, MA-NH Metro Area	2000
5	2010	16980	Chicago-Naperville-Elgin, IL-IN-WI Metro Area	2000
6	2015	16980	Chicago-Naperville-Elgin, IL-IN-WI Metro Area	2000
7	2010	19100	Dallas-Fort Worth-Arlington, TX Metro Area	2000
8	2015	19100	Dallas-Fort Worth-Arlington, TX Metro Area	2000
9	2010	19740	Denver-Aurora-Lakewood, CO Metro Area	2000
10	2015	19740	Denver-Aurora-Lakewood, CO Metro Area	2000
11	2010	19820	Detroit-Warren-Dearborn, MI Metro Area	2000
12	2015	19820	Detroit-Warren-Dearborn, MI Metro Area	2000
13	2010	26420	Houston-The Woodlands-Sugar Land, TX Metro Area	2000
14	2015	26420	Houston-The Woodlands-Sugar Land, TX Metro Area	2000
15	2010	31080	Los Angeles-Long Beach-Anaheim, CA Metro Area	2000
16	2015	31080	Los Angeles-Long Beach-Anaheim, CA Metro Area	2000
17	2010	33100	Miami-Fort Lauderdale-West Palm Beach, FL Metro Area	2000
18	2015	33100	Miami-Fort Lauderdale-West Palm Beach, FL Metro Area	2000
19	2010	33460	Minneapolis-St. Paul-Bloomington, MN-WI Metro Area	2000
20	2015	33460	Minneapolis-St. Paul-Bloomington, MN-WI Metro Area	2000
21	2010	35620	New York-Newark-Jersey City, NY-NJ-PA Metro Area	2000
22	2015	35620	New York-Newark-Jersey City, NY-NJ-PA Metro Area	2000
23	2010	37980	Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area	2000
24	2015	37980	Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area	2000
25	2010	38060	Phoenix-Mesa-Scottsdale, AZ Metro Area	2000
26	2015	38060	Phoenix-Mesa-Scottsdale, AZ Metro Area	2000
27	2010	40140	Riverside-San Bernardino-Ontario, CA Metro Area	2000
28	2015	40140	Riverside-San Bernardino-Ontario, CA Metro Area	2000
29	2010	41180	St. Louis, MO-IL Metro Area	2000
30	2015	41180	St. Louis, MO-IL Metro Area	2000
31	2010	41740	San Diego-Carlsbad, CA Metro Area	2000
32	2015	41740	San Diego-Carlsbad, CA Metro Area	2000
33	2010	41860	San Francisco-Oakland-Hayward, CA Metro Area	2000
34	2015	41860	San Francisco-Oakland-Hayward, CA Metro Area	2000
35	2010	42660	Seattle-Tacoma-Bellevue, WA Metro Area	2000
36	2015	42660	Seattle-Tacoma-Bellevue, WA Metro Area	2000
37	2010	45300	Tampa-St. Petersburg-Clearwater, FL Metro Area	2000
38	2015	45300	Tampa-St. Petersburg-Clearwater, FL Metro Area	2000
39	2010	47900	Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area	2000
40	2015	47900	Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area	2000

As before, we can use a beeswarm plot to plot the distribution of house values by metro area.

#make a beewarm plot:
ggplot(data=dt.samp,
         aes(y=factor(year),
             x=valueh,color=log(valueh)))+
  geom_quasirandom(alpha=0.5,size=0.5)+
  theme_minimal()+
  scale_color_viridis(name="House Value\n$,log scale\n",discrete=F,option="D",end=0.95,direction=-1,
                      limits=c(log(10000),log(1.4e6)),
                      breaks=c(log(10000),log(100000),log(1e6)),
                      labels=c("$10k","$100k","$1,000k") ) +
  scale_x_log10(limits=c(10000,1.4e6),breaks=c(10000,100000,1000000),
                labels=c("$10k","$100k","$1,000k") )+
  labs(y="",x="House Value ($, log scale)",
       caption="@lenkiefer Source: Census 1-year American Community Survey (2010 & 2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.",
       title="House value distribution by Metro")+
  theme(axis.text.x = element_text(size=6),
        strip.text.x = element_text(size = 5),
        plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)),
        plot.title=element_text(size=14),
        legend.position = "top"   )+
  facet_wrap(~cbsa.name,ncol=4)+theme()

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 76 rows containing missing values (position_quasirandom).

## Warning: Removed 98 rows containing missing values (position_quasirandom).

## Warning: Removed 69 rows containing missing values (position_quasirandom).

## Warning: Removed 100 rows containing missing values (position_quasirandom).

## Warning: Removed 53 rows containing missing values (position_quasirandom).

## Warning: Removed 93 rows containing missing values (position_quasirandom).

## Warning: Removed 112 rows containing missing values (position_quasirandom).

## Warning: Removed 224 rows containing missing values (position_quasirandom).

## Warning: Removed 109 rows containing missing values (position_quasirandom).

## Warning: Removed 77 rows containing missing values (position_quasirandom).

## Warning: Removed 150 rows containing missing values (position_quasirandom).

## Warning: Removed 75 rows containing missing values (position_quasirandom).

## Warning: Removed 98 rows containing missing values (position_quasirandom).

## Warning: Removed 80 rows containing missing values (position_quasirandom).

## Warning: Removed 165 rows containing missing values (position_quasirandom).

## Warning: Removed 359 rows containing missing values (position_quasirandom).

## Warning: Removed 93 rows containing missing values (position_quasirandom).

## Warning: Removed 69 rows containing missing values (position_quasirandom).

## Warning: Removed 97 rows containing missing values (position_quasirandom).

## Warning: Removed 94 rows containing missing values (position_quasirandom).

## Warning: Removed 7 rows containing missing values (geom_point).

We see quite a bit of variation across metro areas. But how does the distribution of house values compare to the distribution of incomes? Let’s look at Washington D.C. and plot estimates of homeowner household income versus the value of those homeowner’s homes. Note, we’ve already subsetted on homeowner households, so this metric is different from a more commonly house value to income ratio that uses all households. We are excluding renters from this analysis

  ggplot(data=dt.x[year==2015 & met2013==47900,], aes(valueh,weight=hhwt))+
  geom_density(alpha=0.5,aes(fill="House Value"))+
  geom_density(aes(hhincome,fill="Household Income"),alpha=0.5)+
  scale_color_manual(name="",values=c("black","black"))+
  scale_fill_viridis(discrete=T,end=0.85,name="")+
  scale_x_log10(label=scales::dollar,limits=c(10000,2e6))+
  theme_minimal()+labs(x="Income and Home Values",y="",
                        title="House value and homeowner income distribution in Washington D.C. in 2015",
                       subtitle="kernel density estimates using household weights",
                       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")+
    theme(plot.title=element_text(size=14),legend.position="top",
          axis.text.y=element_blank(),
          plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  facet_wrap(~cbsa.name,scales="free_y")

## Warning in self$trans$transform(x): NaNs produced

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 393 rows containing non-finite values (stat_density).

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning: Removed 261 rows containing non-finite values (stat_density).

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

And we can construct a small multiple:

  ggplot(data=dt.x[year==2015 & met2013>=4900,], aes(valueh,weight=hhwt))+
  geom_density(alpha=0.5,aes(fill="House Value"))+
  geom_density(aes(hhincome,fill="Household Income"),alpha=0.5)+
  scale_color_manual(name="",values=c("black","black"))+
  scale_fill_viridis(discrete=T,end=0.85,name="")+
  scale_x_log10(label=scales::dollar,limits=c(10000,3e6))+
  theme_minimal()+labs(x="Income and Home Values",y="",
                        title="House value and homeowner income distribution by Metro in 2015",
                       subtitle="kernel density estimates using household weights",
                       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")+
    theme(plot.title=element_text(size=14),legend.position="top",
          axis.text.y=element_blank(),
          strip.text.x = element_text(size = 5),
          plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  facet_wrap(~cbsa.name,scales="free_y",ncol=4)

## Warning in self$trans$transform(x): NaNs produced

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 4313 rows containing non-finite values (stat_density).

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning: Removed 7994 rows containing non-finite values (stat_density).

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

But how does the income of each homeowner compare to the value of their home? Let’s take each homeowner in our sample and construct the ratio of their house value to their income.

dt.x<-dt.x[,pti:=valueh/hhincome]  #price to income ratio

Now we can plot the distribution of this ratio for Washington D.C.:

  ggplot(data=dt.x[year==2015 & met2013==47900,], aes(pti,weight=hhwt))+
  geom_density(alpha=0.5,aes(fill="House value to income ratio"))+
  #geom_density(aes(hhincome,fill="Household Income"),alpha=0.5)+
  scale_color_manual(name="",values=c("black","black"))+
  scale_fill_viridis(discrete=T,end=0.85,name="")+
  scale_x_continuous(limits=c(0,20),breaks=seq(0,20,2))+
  theme_minimal()+labs(x="House vaue to income ratio",y="",
                        title="Ratio of house value to homeowner income distribution in Washington D.C. in 2015",
                       subtitle="kernel density estimates using household weights",
                       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")+
    theme(plot.title=element_text(size=14),legend.position="none",
          axis.text.y=element_blank(),
          plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  facet_wrap(~cbsa.name,scales="free_y")

## Warning: Removed 600 rows containing non-finite values (stat_density).

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

The average ratio is about 4 in Washington D.C., but it varies a lot by metro area. Let’s create another small multiple.

  ggplot(data=dt.x[year==2015 & met2013>=4900,], aes(pti,weight=hhwt))+
  geom_density(alpha=0.5,aes(fill="House value to income ratio"))+
  #geom_density(aes(hhincome,fill="Household Income"),alpha=0.5)+
  scale_color_manual(name="",values=c("black","black"))+
  scale_fill_viridis(discrete=T,end=0.85,name="")+
  scale_x_continuous(limits=c(0,20),breaks=seq(0,20,2))+
  theme_minimal()+labs(x="House vaue to income ratio",y="",
                        title="Ratio of house value to homeowner income distribution by metro in 2015",
                       subtitle="kernel density estimates using household weights",
                       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")+
    theme(plot.title=element_text(size=14),legend.position="none",
          axis.text.y=element_blank(),
          strip.text.x = element_text(size = 5),
          plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  facet_wrap(~cbsa.name,scales="free_y",ncol=4)

## Warning: Removed 14571 rows containing non-finite values (stat_density).

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

Let’s compute the weighted average ratio (excluding any values less than 0 or greater than 20) by metro area in 2010 and 2015:

dt.wsum<-dt.x[pti>0 & pti<20,list(pti.wm=weighted.mean(pti,hhwt,na.rm=T)),by=c("year","cbsa.name")]

ggplot(data=dt.wsum[year==2015,], 
       aes(x=pti.wm,y=reorder(cbsa.name,-pti.wm),
           label=paste("  ",round(pti.wm,1),cbsa.name),
           color=factor(year)))+
  geom_point(size=3)+theme_minimal()+
  geom_text(hjust=0)+
  scale_x_continuous(limits=c(2.5,9),breaks=seq(2,8,1))+
  scale_color_viridis(name="Year",discrete=T,end=0.85)+
  theme(plot.title=element_text(size=14),legend.position="none",
        axis.text.y=element_blank(),
        plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
  labs(x="Weighted average homeowner value to household income ratio",
       y="Metro",
       title="Ratio of house value to homeowner income distribution by metro in 2015",
       caption="@lenkiefer Source: Census 1-year American Community Survey (2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.")

This plot makes clear that there is a pretty wide variation in the average house value to household income ratio. We can look at the full distributions through an animated gif (using our usual tweenr approach).

library(tweenr)
library(animation)
library(tidyverse)
library(tidyr)



#  Function for use with tweenr
myf<-function (m, yy=2015){
  d.out<-copy(dt.samp)[met2013==m]
  d.out %>% map_if(is.character, as.factor) %>% as.data.frame -> d.out
  return(d.out)
}

# get list of metros from our summay data: metro.list

# Circle back to Atlanta (met2013==12060)
my.list2<-lapply(c(unique(metro.list$met2013),12060),myf)  


#use tweenr to interploate
tf <- tween_states(my.list2,tweenlength= 3,
                   statelength=2, ease=rep('cubic-in-out',2),nframes=200)
tf<-data.table(tf) #convert output into data table

#Animate plot
oopt = ani.options(interval = 0.2)
saveGIF({for (i in 1:max(tf$.frame)) { #loop over frames
  g<-
    ggplot(data=tf[.frame==i & pti<= 20 & hhincome>10000 & valueh > 5000 ],
           aes(y=factor(year),
               x=pti,color=pti))+
    geom_quasirandom(alpha=0.5,size=0.75)+
    theme_minimal()+
    scale_color_viridis(name="House value to\nhousehold income\nratio",
                        discrete=F,option="D",
                        limits=c(0,20),
                        end=0.95,direction=-1  #,
                        ) +
    #scale_x_log10(limits=c(10000,1.4e6),breaks=c(10000,100000,1000000),   
    #  labels=c("$10k","$100k","$1,000k") )+
    scale_x_continuous(limits=c(0,20),breaks=seq(0,20,2.5))+
    theme(plot.title=element_text(size=14))+
    theme(plot.caption=element_text(hjust=0,vjust=1,margin=margin(t=10)))+
    theme(plot.margin=unit(c(0.25,0.25,0.25,0.25),"cm"))+
    theme(legend.position = "top")+
    labs(y="",x="House value to household income ratio",
         caption="@lenkiefer Source: Census 1-year American Community Survey (2010 & 2015),\nIPUMS-USA, University of Minnesota, www.ipums.org.\nTo avoid extreme overplotting, 2000 observations sampled at random (using weights),\nonly includes homeowner households with > $10,000 income and an estimated house value > $5,000 & cases where ratio <= 20",
         title="House value to income ratio distribution by Metro",
         subtitle=head(tf[.frame==i,],1)$cbsa.name)+
    theme(axis.text.x = element_text(size=8))+  #coord_flip()+
    #facet_wrap(~year)+
    theme(strip.text.x = element_text(size = 6))
  print(g)
  ani.pause()
  print(i)}
},movie.name="tween acs value 04 16 2017 v2.gif",ani.width = 750, ani.height = 400)

Run it and you get:

metro pti ratios

Of kernels and beeswarms: Comparing the distribution of house values to household income

Load data

Share!