Bootstrapping is a resampling technique used in statistics to estimate the distribution of a statistic (like the mean, median, or variance) by repeatedly sampling from a dataset with replacement. The term “bootstrap” comes from the idea of “pulling oneself up by one’s bootstraps,” as the method relies only on the data at hand, without requiring additional data or assumptions.
Bootstrapping begins with an original dataset of size ( n ). From this, multiple “bootstrap samples” are created by randomly sampling ( n ) data points with replacement. Each bootstrap sample is used to compute the desired statistic. By repeating this process many times (e.g., 1,000 iterations), a distribution of the statistic is generated, allowing for estimates of variability and confidence intervals.
Tip
For example, if the original dataset is ([A, B, C, D, E]), a bootstrap sample might look like ([A, C, C, E, D]), containing duplicates and possibly excluding some original values.
Applications of Bootstrapping
Bootstrapping is widely used in:
- Estimating confidence intervals for statistics like the mean, median, or regression coefficients.
- Hypothesis testing by approximating the sampling distribution of a test statistic.
- Regression analysis to assess the stability of coefficients.
- Machine learning techniques like bagging (e.g., Random Forest). While versatile, bootstrapping has its constraints. It can be computationally expensive due to repeated sampling, and its results depend on the quality and representativeness of the data. Small datasets may not capture the full variability of the population, limiting the method’s effectiveness.
Estimating confidence interval via bootstrapping
In classical statistics, the standard error of the mean (SE) is calculated as:
where ( n ) is the sample size.
This formula assumes that the sampling distribution of the mean is normally distributed and that the sample standard deviation is a good estimate of the population standard deviation. If these assumptions hold, the 95% confidence interval for the mean is:
where ( z ) is the critical value from the standard normal distribution (e.g., ( z = 1.96 ) for 95% confidence). Bootstrapping provides us a technique that doesn’t have the same assumptions, since we can apply bootstrapping and calculate sample means, forming the bootstrap distribution of the mean. Once we have the sample distribution of the means, we can order the means and find the percentiles corresponding to the desired confidence level
Tip
The choice of ( N ) depends on computational resources and the level of precision you need: a common default is 1000, often sufficiently stable. 5000 samples provide higher precision, while beyond 10000 there are diminishing returns in most cases
Example with Steps
Dataset:
([75, 80, 85, 90, 95]), ( N = 1000 ) bootstrap samples.
Step-by-Step Process:
-
Resample ( N = 1000 ) times:
- Example bootstrap samples:
- Sample 1: ([75, 75, 90, 85, 95]), Mean = 84.
- Sample 2: ([80, 90, 85, 80, 95]), Mean = 86.
- Sample 3: ([75, 85, 85, 95, 90]), Mean = 86.
- … Continue for 1000 samples.
- Example bootstrap samples:
-
Calculate Means:
- After 1000 bootstrap samples, you have a distribution of means, e.g., ([83.5, 84, 85, 86, 87.2, \dots]).
-
Sort the Means:
- Order the bootstrap means: ([83, 83.5, 84, \dots, 87, 87.5]).
-
Compute Confidence Interval:
- For a 95% confidence interval:
- Lower bound: 2.5th percentile (mean at position ( 0.025 \cdot 1000 = 25 )).
- Upper bound: 97.5th percentile (mean at position ( 0.975 \cdot 1000 = 975 )).
- Suppose these values are ( [83, 87] ).
- For a 95% confidence interval:
Final Result:
- Point Estimate: Mean = 85.
- 95% Confidence Interval: ( [83, 87] ).
Bootstrapping generates a data-driven approximation of the sampling distribution of the mean. Unlike the standard error approach, it doesn’t rely on assumptions like normality, making it more robust for real-world data.