bootstrap aggregation (or bagging) is an ensemble method that averages results from individual models. A popular algorithm is Random Forest
Process
- Repeatedly select random sample with replacement of the training set, to create many bootstrap samples to train base learners (each is an individual model)
- Train base learners on their corresponding samples to get results. This is done in parallel, instead of sequentially as boosting.
- After training, *aggregate predictions from *
- regression: averaging their results
- classification: taking the majority of votes
- Tune hyperparameters
Usage
- Reduces variance: Standalone models can result in high variance. Aggregating base models’ predictions in an ensemble help reduce it.
- Fast: Training can happen in parallel across CPU cores and even across different servers.
- Good for big data: Bagging doesn’t require an entire training dataset to be stored in memory during model training. We can set the sample size for each bootstrap to a fraction of the overall data, train a base learner, and combine these base learners without ever reading in the entire dataset all at once.