Statistics

Beige-ian Statistics

Let’s pick up where we left off yesterday and do some more exploration with text mining. Like yesterday we’ll use the tidytext package for R. And we’ll lean heavily on Julie Silge and David Robinson’s Text Mining with R. Data We’ll turn again to the Federal Reserve for our text data. But today we’ll explore the Beige Book, which gathers anecdotal information on current economic conditions across the Federal Reserve Districts.

Maybe the Linear Probability Model isn't all bad

The Linear Probability Model (LPM) might be bad, but is it all bad? Let’s look at some conditions where the LPM might not be so bad. We’ll also look at some simple adjustments that might improve the performance of the LPM. We’ll also compare the LPM to some common alternatives. Setup Throughout most of this post, we’re going to consider a world where the LPM model is the true model. That is:

How bad is a Linear Probability Model?

I think a lot about predicting/forecasting binary outcomes. Will the economy head into a recession next year? What’s the likelihood of a loan defaulting over the next few years? Will my followers on social media abandon me if I tweet about my lunch? One often maligned, but seemingly irresitable approach to modeling binary ourcomes is the Linear Probability Model (LPM). As is known going back to before I was born, the Linear Probability Model has some issues.

Forecasting and deciding binary outcomes under asymmetric information

LAST WEEK IN THE WALL STREET JOURNAL an article LINK talked about how pundits can strategically make probabilistic forecasts. It seems 40% is a sort of magic number, where it’s high enough that if the event comes true you can claim credit as a forecaster, but if it doesn’t happen, you still gave it less than 50/50 odds. Since I’m often asked to make forecasts I’m interested in this problem. Under what conditions is a 40 percent probability an optimal forecast?

A note on competing risks

WE ARE LATE FOR HALLOWEEN, but let’s get out our broom and purrr as we tidy some statistical results. Today I had occasion to be reminded of competing risks and a handy statistical result on competing risks from A.P. Basu and J.K. Ghosh published in the Journal of Multivariate Analysis in 1978. The paper Identifiability of the multinormal and other distributions under competing risks model showed an analytical result on the distribution of a variable Z which is the minimum of two Gaussian (Normal) random variables.

Dynamic Model Averaging Presentation Slides

I PUT TOGETHER SOME SLIDES SUMMARIZING our recent work on dynamic model averaging. See here and here for more blah blah blah. See below for some slides. Click here for a fullscreen version here. Making the Preso Let me also share with you the R code I used to generate these slides. The code below is the Rmarkdown I used to generate the slides (saved as .txt). The slides were put together using the xaringan package.

A closer look at forecasting recessions with dynamic model averaging

BACK WE GO INTO THE VASTY DEEP. LAST TIME we introduced the idea of using dynamic model averaging to forecast recessions. I was so excited about the new approach that I didn’t take the time to break down what was going on with it. In this post we’ll look more closely at what’s happening with the dma packaged when we try to forecast recessions. Per usual we’ll do it with R and I’ll include code so you can follow along.

Forecasting recessions with dynamic model averaging

HERE THE LITERATURE IS VASTY DEEP. In this post we’ll dip our toes, every so slightly, into the dark waters of macroeconometric forecasting. I’ve been studying some techniques and want to try them out. I’m still at the learning and exploring stage, but let’s do it together. In this post we’ll conduct an exercise in forecasting U.S. recessions using several approaches. Per usual we’ll do it with R and I’ll include code so you can follow along.

Of kernels and beeswarms: Comparing the distribution of house values to household income

BACK IN JANUARY WE LOOKED AT HOUSING microdata from the American Community Survey Public Microdata that we collected from IPUMS. Let’s pick back up and look at these data some more. Glad you could join us. Be sure to check out my earlier post for more discussion of the underlying data. Here we’ll pick up where we left off and make some more graphs using R. Just a quick reminder (read the earlier post for all the details), we have a dataset that includes household level observations for the 20 largest metro areas in the United States for 2010 and 2015 (latest data available).

Resampling

THIS PAST MONTH HAS BEEN BUSY. People have been traveling, I’ve been traveling, kids have been sick, and we’ve had the March Madness basketball keeping me occupied. Today I wanted to just explore a little analysis I’ve put together on resampling. Because reasons I’ve recently been interested in sample sizes and how quickly certain estimates might converge. There is of course, a vast literature on this topic. But armed with powerful computers maybe we can avoid too much mathy work and try to simulate our way through some problems.