(RL) Experiments That Matter: Bootstrapping, p-values, & Power
How to create statistically significant RL experiments.
Before learning about RL, my impression of RL was that everything was cherry picked. This is still occasionally the case, but many papers have now introduced significance metrics to guide interpretation. Central to interpretable, statistically significant experiments are the concepts of bootstrapping, p-values, and power.
Motivating the Bootstrap
We’ll use a toy example throughout this post. Let’s say we’re trying to find the average height of a population. We randomly sample 1000 individuals and measure their heights. Then, we calculate the average height of the sample . How do we know if is a good estimate of the population average ?
The Central Limit Theorem
The Central Limit Theorem tells us that if we have a sequence of i.i.d. random variables 1. We assume are drawn from a distribution with finite mean and variance. and is sufficiently large, then the distribution of the sample mean is Gaussian with mean and variance for .
Luckily, we can know how confident we are in our estimate of because we know the distribution of is Gaussian. The standard deviation is simply the square root of the variance: .
Pretending the Central Limit Theorem Doesn’t Exist
What if we didn’t have the Central Limit Theorem? It seems silly to pretend such a powerful theorem doesn’t exist, but we could’ve chosen a statistic that didn’t give us a Gaussian distribution. In either of these cases, we turn to the bootstrap.
The Bootstrap
Sources
- Chris Piech’s CS109 lecture video and notes
- Test