(Google Ref)

Random Forest uses an ensemble method that construct multiple decision trees when training to reduce overfitting

It is an extension of bagging. In addition to creating a bootstrap sample (i.e. subset of training data sampled with replacement) for each tree, Random Forest also choose a random subset of features, where usually

This means that each decision tree in a Random Forest is trained on a different set of features, and no single tree see all the data (in fact, they are all underfitting). Together, Random Forest reduces overfitting and increases stability of predictions

Process

Similar to bagging, with step 2 added.

  1. Repeatedly select random sample with replacement of the training set, to create many bootstrap samples to train trees
  2. Choose a random subset of features for each tree. Usually,
  3. Train base learners on their corresponding samples to get results
  4. After training, aggregate predictions from
    1. regression: averaging their results
    2. classification: taking the majority of votes
  5. Tune hyperparameters (refer to more hyperparameters in decision tree). Some unique ones for Random Forest are max_features, used to set , and n_estimators, used to choose the number of trees included.

Usage

  • Reduce overfitting
  • Less run time (since each tree only use a subset of data and features)

Code for Spam Mail Classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
 
# Instantiate a RandomForestClassifier with: 200 weak learners (n_estimators) and everything else as default values
rf_mod = RandomForestClassifier(n_estimators=200)
 
rf_mod.fit(training_data, y_train)
rf_preds = rf_mod.predict(testing_data)
 
print('Accuracy score: ', format(accuracy_score(y_true, preds)))
print('Precision score: ', format(precision_score(y_true, preds)))
print('Recall score: ', format(recall_score(y_true, preds)))
print('F1 score: ', format(f1_score(y_true, preds)))