Structure
Tree is a greedy algorithm that splits data and minimizes error in each decision node. Most used form is a binary tree
- feature/variable: all variables
- target/class attribute: the variable goal of classification
- root node: the entire sample that gets split into two or more homogeneous sets.
- decision nodes: nodes that split into further sub-nodes
- leaves (terminal nodes): nodes that do not split
- error: proportion of points misclassified
- for example we have a region of 4 blue points and 2 red points, this region should be classified as blue (to maximize purity); but then 2 red points are misclassified →
- for example we have a region of 4 blue points and 2 red points, this region should be classified as blue (to maximize purity); but then 2 red points are misclassified →
- depth of a tree (vertical depth) = the length of the longest path from the root to a leaf (number of splits, NOT nodes)
Usage
Work with structured data, in both classification and regression tasks.
Strength
- Fast
- Easy to understand/interpret by humans
- Useful in data exploration
- Less pre-processing (normalization, scaling, )
- Handle both quantitative (discrete) and qualitative variables
- Non-parametric: no assumptions about the data distribution & classifier structure
Weakness
- Computationally expensive relative to other algorithms like Naive Bayes
- Highly sensitive to small changes of the data ➡️ ensemble method
- Likely to overfit ➡️ tree pruning, ensemble trees such as Random Forest or XGBoost
Process
- Pre-processing
- Categorical/Qualitative features:
- Continuous variables: get
where is the target range
- Splitting: How to choose what feature to split on at each node?
- Tree pruning: When do we stop splitting?
Pre-processing
Work with categorical features
- label encoding
- one hot encoding
Work with continuous valued features
(ML Spec)
Say, we want to make a binary split for
- Sort all training examples by feature
- Take all the values that are midpoints between the sorted list
Splitting
A general purpose is to have more data points of one class (correctly classified) in each sub-node. In other words, we want to maximize the purity of a node.
There are several possible metrics
- minimize entropy (maximize information gain)
- minimize gini impurity (i.e. maximize Gini purity)
Information Gain
Transclude of entropy#formula
Example
Before a split, the entropy of a group of 5 cats and 5 non-cats is
. After splitting on a particular feature, a group of animals (4 of which are cats) has an entropy of . The other group of animals (1 is a cat) and has an entropy of . What is the expression for information gain?
Gini (Binary tree)
Transclude of gini-impurity
Tree pruning
We should stop splitting when improvements in purity are insignificant, to avoid overfitting, by setting constraints for one or more of these hyperparameters:
- Minimum number of samples required in a node to be considered for splitting.
min_samples_split
- Minimum samples per leaf.
min_samples_leaf
- Maximum depth
max_depth
- Maximum number of leaves
max_leaf_nodes
- Maximum features to consider for split
max_features
: As a thumb-rule, square root of the total number of features, but we should check up to 30-40% of the total number of features.
Tree ensemble
Advantage of tree-based models
- High accuracy
- Require minimal pre-processing (e.g. scaling, normalizing)
Code
Decision Tree sklearn GridSearch
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Read the data.
data = pd.read_csv('data.csv', header=None)
# Assign the features to the variable X, and the labels to the variable y.
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
# Create & fit the model
model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 5, min_samples_leaf = 5)
model.fit(X, y)
# Make predictions
y_pred = model.predict(X)
# Calculate the accuracy
acc = accuracy_score(y, y_pred)