ML Specialization (Lab)


Tree is a greedy algorithm that splits data and minimizes error in each decision node. Most used form is a binary tree

  • feature/variable: all variables
  • target/class attribute: the variable goal of classification
  • root node: the entire sample that gets split into two or more homogeneous sets.
  • decision nodes: nodes that split into further sub-nodes
  • leaves (terminal nodes): nodes that do not split
  • error: proportion of points misclassified
    • for example we have a region of 4 blue points and 2 red points, this region should be classified as blue (to maximize purity); but then 2 red points are misclassified →
  • depth of a tree (vertical depth) = the length of the longest path from the root to a leaf (number of splits, NOT nodes)


Work with structured data, in both classification and regression tasks.


  • Fast
  • Easy to understand/interpret by humans
  • Useful in data exploration
  • Less pre-processing (normalization, scaling, )
  • Handle both quantitative (discrete) and qualitative variables
  • Non-parametric: no assumptions about the data distribution & classifier structure



  1. Pre-processing
    1. Categorical/Qualitative features:
    2. Continuous variables: get where is the target range
  2. Splitting: How to choose what feature to split on at each node?
  3. Tree pruning: When do we stop splitting?


Work with categorical features

Work with continuous valued features

(ML Spec) Say, we want to make a binary split for training examples of feature

  1. Sort all training examples by feature
  2. Take all the values that are midpoints between the sorted list


A general purpose is to have more data points of one class (correctly classified) in each sub-node. In other words, we want to maximize the purity of a node.

There are several possible metrics

Information Gain

Transclude of entropy#formula

Gini (Binary tree)

Transclude of gini-impurity

Tree pruning

We should stop splitting when improvements in purity are insignificant, to avoid overfitting, by setting constraints for one or more of these hyperparameters:

  1. Minimum number of samples required in a node to be considered for splitting. min_samples_split
  2. Minimum samples per leaf. min_samples_leaf
  3. Maximum depth max_depth
  4. Maximum number of leaves max_leaf_nodes
  5. Maximum features to consider for split max_features: As a thumb-rule, square root of the total number of features, but we should check up to 30-40% of the total number of features.

Tree ensemble

Advantage of tree-based models

  • High accuracy
  • Require minimal pre-processing (e.g. scaling, normalizing)


Decision Tree sklearn GridSearch

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Read the data.
data = pd.read_csv('data.csv', header=None)
# Assign the features to the variable X, and the labels to the variable y. 
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
# Create & fit the model 
model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 5, min_samples_leaf = 5), y)
# Make predictions
y_pred = model.predict(X)
# Calculate the accuracy 
acc = accuracy_score(y, y_pred)