Section 2: The Tree-Building Algorithm: A Greedy, Recursive Process

Section 2: The Tree-Building Algorithm: A Greedy, Recursive Process#

The construction of a Decision Tree is a greedy, recursive, and top-down process, known as Recursive Binary Splitting (or partitioning).

Key Idea: The core idea behind building a decision tree is recursive partitioning; repeatedly splitting the dataset based on an attribute that maximizes separation between classes (or minimizes error in regression).

Detailed Concept of Decision Tree Learning Algorithm:#

Greedy: At each node, the algorithm selects the locally optimal split without considering the overall tree structure or future splits. It makes the best decision it can at that moment.
Recursive: The process is repeated for each subset of the data created by previous splits.
Top-Down: It begins at the root and successively splits the dataset into smaller groups.

The Core Algorithm (ID3 and its successors C4.5, CART):#

The Iterative Dichotomiser 3 (ID3) algorithm is a decision tree induction algorithm that constructs decision trees using a top-down approach. The main goal of ID3 is to find the most informative attributes at each step to create partitions that yield the highest information gain or decrease in impurity.

Let’s break down the ID3 algorithm into a step-by-step process:

➤ 1. Start at the root node with the entire dataset.
➤ 2. For each feature in the dataset, calculate a “goodness” metric (e.g., Information Gain, Gini Impurity) for all possible split points.

In this chapter, we will focus on “Entropy” and “Information Gain”.

➤ 3. Select the feature and split point that maximizes this “goodness” metric. This becomes the splitting rule for the node.
➤ 4. Split the dataset at the current node into child nodes based on the selected rule.

The process recursively repeat steps 2-4S for each child node until a stopping criterion is met.

Common stopping criteria include:

The node is “pure” (homogeneous: all samples belong to the same class).
No more attributes remain for further splitting.
A stopping condition (like a predefined maximum depth) is reached.
The node contains fewer samples than the predefined minimum threshold.
A split does not produce a significant improvement (minimum gain threshold).

Splitting Criteria (Attribute Selection Measures):#

The algorithm needs a metric to determine the “best” split. For classification trees, the goal is to maximize the homogeneity of the resulting subsets. The main metrics are Entropy and Information Gain (Used in ID3, C4.5).

Entropy#

Entropy measures the randomness or impurity in the dataset.

In information theory, the entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent to the variable’s possible outcomes.

In a decision tree, a node that contains a mix of different outcomes (for example, 2 Pass and 2 Fail) has higher entropy, meaning more disorder or uncertainty. If a node has only one type of outcome (for example, all Pass or all Fail), it is pure and has low entropy.

Maximum entropy = 1 → occurs when the outcomes are equally mixed (e.g., 50% Pass, 50% Fail).
Minimum entropy = 0 → occurs when all outcomes are the same (e.g., 100% Pass or 100% Fail).

In short: Mixed outcomes → high entropy (uncertain) and Single outcome → low entropy (certain)

The first thing to understand about Decision Trees is that they split the data based on the input features into smaller groups that are more homogeneous (or “pure”) with respect to the target variable.

For example, if the target variable has two possible outcomes — say 1 and 0 (represented by green and red dots) — the Decision Tree looks for splits in the features that create groups mostly containing either 1’s or 0’s, rather than a mix of both.

In simple terms, the tree keeps splitting the data using feature values so that each group becomes as pure as possible in terms of the target outcome.

Definition#

Entropy is defined (or calculated) as:

\[ Entropy(S) = - \sum_{k=1}^{c} p_k \log_2(p_k) \]

where:

\(S\) = the current node (or dataset)
\(c\) = number of classes
\(p_k\) = proportion of data points in class \(k\) within node \(S\)

Properties of Entropy#

Entropy = 0:
The node is pure — all samples belong to a single class (no uncertainty).
Entropy = 1:
The node is maximally impure — classes are equally mixed (e.g., 50% Pass, 50% Fail).
Range:
\( 0 \leq Entropy(S) \leq 1 \) for binary classification.
(For multi-class problems, entropy can be higher, depending on the number of classes.)

Example#

If a node contains 4 samples:

3 are Pass,
1 is Fail,

then:

\[ Entropy(S) = -\left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) \]

\[ Entropy(S) = -(0.75 \times -0.415) - (0.25 \times -2) \]

\[ Entropy(S) \approx 0.81 \]

So, this node has some impurity but is not completely mixed.

Try it Yourself#

If a node contains 5 examples, where 4 are Pass and 1 is Fail, calculate the entropy.

You can use the Python function below to verify your answer!

import math

def entropy(class_counts):
    """
    Compute entropy for a list of class counts.
    class_counts: list of counts for each class
    """
    total = sum(class_counts)
    entropy_value = 0
    for count in class_counts:
        if count == 0:  # avoid log(0)
            continue
        p = count / total
        entropy_value -= p * math.log2(p)
    return entropy_value

# Example 1: 4 Pass, 1 Fail
print("Entropy (4 Pass, 1 Fail):", round(entropy([4, 1]), 3))

# Example 2: 2 Pass, 2 Fail (max impurity)
print("Entropy (2 Pass, 2 Fail):", round(entropy([2, 2]), 3))

# Example 3: All Pass (pure node)
print("Entropy (5 Pass, 0 Fail):", round(entropy([5, 0]), 3))

Entropy (4 Pass, 1 Fail): 0.722
Entropy (2 Pass, 2 Fail): 1.0
Entropy (5 Pass, 0 Fail): 0.0