29 March 2017

THIS PAST MONTH HAS BEEN BUSY. People have been traveling, I’ve been traveling, kids have been sick, and we’ve had the March Madness basketball keeping me occupied. Today I wanted to just explore a little analysis I’ve put together on resampling.

Because reasons I’ve recently been interested in sample sizes and how quickly certain estimates might converge.

There is of course, a vast literature on this topic. But armed with powerful computers maybe we can avoid too much mathy work and try to simulate our way through some problems.

## Setup

For this exercise I want to keep things simple. Let’s imagine that we have a sample drawn from an independent and identically distributed (i.i.d.) Normal distribution. We’ll assume that our original sample is 100 observations and we’re interested in the properties of the mean.

Per usual we’ll use R to do our analysis. And because we’ll be making up our data we won’t need to worry about importing data. Usually I use the data.table package, but today I’m going to try to use the tidyverse.

To keep sanity, we’ll need to start out after loading our libraries by setting the seed and drawing 100 observations:

And let’s look at the data:

Well okay. Now what are we going to do with it?

## Reducing the sample size

Let’s imagine that collecting these data was expensive so we were interested in knowing how well we could approximate them with some n < N where N was our original sample size (100) and n is some smaller number.

Of course, if we knew the distribution of the data we could derive it analytically. For an i.i.d. standard normal distribution the standard deviation of our sample mean should be $\frac{1}{\sqrt{n}}$ .

But if we didn’t know the true underlying distribution we might try to estimate it through resampling. The idea, would be to draw random samples (or subsamples) of our draw and see how estimates of the mean varied across draws.

What we will do is draw 5,000 random samples from our original sample of size 1,2,…100. We’ll end up with 5000*100=500,000 samples.

Okay, that was fun. We’ve got a giant set of resamples. Now we can use some dplyr to summarize the data.

Now let’s compare the theoretical standard deviation to the estimates from our resamples.

This shows that for smaller n the resampling approaches approximate the theoretical standard deviation pretty well, but as n approaches N the dependence created by resampling without replacement causes that approximation to perform worse.

We might be able to see that better by plotting the distributions.

Create density plots over draws of varying sample sizes:

# Okay so what?

This post let us simulate some data and draw some plots. We also used dplyr to manipulate data and the map function to store data inside a data frame.

We might be able to use the techniques for more sophisticated analysis in future.