ML Spec, mlcourse.ai

bootstrap aggregation (or bagging) is an ensemble method that averages results from individual models. A popular algorithm is Random Forest

Process

  1. Repeatedly select random sample with replacement of the training set, to create many bootstrap samples to train base learners (each is an individual model)
  2. Train base learners on their corresponding samples to get results. This is done in parallel, instead of sequentially as boosting.
  3. After training, *aggregate predictions from *
    1. regression: averaging their results
    2. classification: taking the majority of votes
  4. Tune hyperparameters

Usage

  • Reduces variance: Standalone models can result in high variance. Aggregating base models’ predictions in an ensemble help reduce it.
  • Fast: Training can happen in parallel across CPU cores and even across different servers.
  • Good for big data: Bagging doesn’t require an entire training dataset to be stored in memory during model training. We can set the sample size for each bootstrap to a fraction of the overall data, train a base learner, and combine these base learners without ever reading in the entire dataset all at once.