Information Gain (IG)

Information Gain (IG)#

Once we know how to calculate Entropy, we can measure how much uncertainty (or impurity) is reduced when we split a dataset using a feature. This reduction is called Information Gain.

Information Gain measures the reduction in entropy achieved by a split. It tells us how much uncertainty (impurity) is removed by a particular split.

Definition#

Formally, suppose a split divides the parent node $S$ into $V$ subsets:
$$ S_1, S_2, ..., S_V $$

Let:

$|S|$ = total number of samples in the parent node
$|S_v|$ = number of samples in child node $S_v$

Information Gain for a feature A is defined as:

\[ IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \, Entropy(S_v) \]

where:

$S$ = parent node (original dataset)
$A$ = feature used for splitting
$S_v$ = subset of $S$ corresponding to value $v$ of feature $A$
$|S|$, $|S_v|$ = number of samples in $S$ and $S_v$, respectively

The algorithm selects the feature with the highest Information Gain for splitting.

Intuition#

Information Gain measures how much a split reduces impurity:

A high IG means the feature created purer subsets (good split).
A low IG means the feature did not reduce impurity much (poor split).

Decision Trees choose the feature with the highest Information Gain at each step.

Example#

Suppose a dataset has 10 samples:

6 Pass, 4 Fail

Then the initial (parent) entropy is:

\[ Entropy(S) = -\left( \frac{6}{10}\log_2\frac{6}{10} + \frac{4}{10}\log_2\frac{4}{10} \right) \approx 0.971 \]

Now we split based on Feature A (e.g., “Study Hours”) which has two possible values:

A = High: 5 samples → 4 Pass, 1 Fail
A = Low: 5 samples → 2 Pass, 3 Fail

Compute entropy for each group:

\[ Entropy(S_{High}) = -\left( \frac{4}{5}\log_2\frac{4}{5} + \frac{1}{5}\log_2\frac{1}{5} \right) \approx 0.722 \]

\[ Entropy(S_{Low}) = -\left( \frac{2}{5}\log_2\frac{2}{5} + \frac{3}{5}\log_2\frac{3}{5} \right) \approx 0.971 \]

Then the weighted average entropy after splitting is:

\[ Entropy_{after} = \frac{5}{10}(0.722) + \frac{5}{10}(0.971) = 0.847 \]

Finally, the Information Gain is:

\[ IG(S, A) = 0.971 - 0.847 = 0.124 \]

So, splitting on Feature A reduces uncertainty by 0.124 bits.

Additional Example:#

Example illustrating the calculation of information gain. Source: Hendler 2018, slide 46.

Try it Yourself#

If you have another feature B that splits the same data into:

B = Yes: 6 samples (5 Pass, 1 Fail)
B = No: 4 samples (1 Pass, 3 Fail)

Try calculating:

\[ Entropy(S_{Yes}), \; Entropy(S_{No}), \; \text{and} \; IG(S, B) \]

Which feature gives a higher Information Gain — A or B?

import math

def entropy(class_counts):
    """
    Compute entropy for a list of class counts.
    class_counts: list of counts for each class
    """
    total = sum(class_counts)
    entropy_value = 0
    for count in class_counts:
        if count == 0:  # avoid log(0)
            continue
        p = count / total
        entropy_value -= p * math.log2(p)
    return entropy_value

def information_gain(parent_counts, split_counts_list):
    """
    Compute Information Gain.
    parent_counts: list of counts in the parent node
    split_counts_list: list of lists, each sublist contains counts in a child node
    """
    total_parent = sum(parent_counts)
    parent_entropy = entropy(parent_counts)

    weighted_entropy = 0
    for counts in split_counts_list:
        weight = sum(counts) / total_parent
        weighted_entropy += weight * entropy(counts)

    ig = parent_entropy - weighted_entropy
    return ig

# Example: Feature A split
# Parent node: 6 Pass, 4 Fail
parent = [6, 4]

# Feature A splits:
# High: 4 Pass, 1 Fail
# Low: 2 Pass, 3 Fail
split_A = [[4, 1], [2, 3]]

print("Information Gain for Feature A:", round(information_gain(parent, split_A), 3))

# Feature B splits:
# Yes: 5 Pass, 1 Fail
# No: 1 Pass, 3 Fail
split_B = [[5, 1], [1, 3]]

print("Information Gain for Feature B:", round(information_gain(parent, split_B), 3))

Information Gain for Feature A: 0.125
Information Gain for Feature B: 0.256