Mining Association Rules: Two Steps

Mining Association Rules: Two Steps#

Association rule mining usually has two main steps:

Step 1: Frequent Itemset Generation#

First, we find all itemsets that appear frequently (satisfy the minimum support threshold) in the dataset.

Example frequent itemsets:

{Bread}
{Milk}
{Bread, Butter}
{Milk, Bread}

Step 2: Rule Generation#

After finding frequent itemsets, we generate association rules from them.

For example, from the frequent itemset:

{Bread, Butter}

we can generate rules such as:

Bread → Butter
Butter → Bread

Then we evaluate these rules using measures such as confidence and lift.

As datasets grow larger, the number of possible item combinations increases very quickly. Checking every possible itemset would be computationally expensive.

To make frequent itemset generation more efficient, we use the Apriori Principle, which helps eliminate impossible candidate itemsets early.

Apriori Principle#

The Apriori principle helps reduce the number of itemsets we need to check.

It says:

If an itemset is frequent, then all of its subsets must also be frequent.

It also means:

If an itemset is not frequent, then any larger itemset containing it cannot be frequent.

Figure: Apriori Principle: If an itemset is not frequent, then any larger itemset containing it cannot be frequent

Example:

If {Bread, Butter} is not frequent, then {Bread, Butter, Milk} cannot be frequent either.

This saves time because we do not need to check larger itemsets that contain infrequent smaller itemsets.

Apriori Algorithm#

The Apriori algorithm is a classic algorithm used for frequent itemset generation.

It works level by level:

Find all frequent 1-itemsets by checking their support count.
Generate candidate 2-itemsets from the frequent 1-itemsets
Count the support count of each candidate itemset
Remove infrequent itemsets whose support count is below the minimum threshold using the Apriori principle
Repeat the process to generate larger frequent itemsets until no more frequent itemsets remain

At each step, Apriori prunes itemsets that cannot possibly be frequent, which greatly reduces computation.

Figure: Frequent itemset generation using the Apriori algorithm with minimum support count = 2. Itemsets marked with * are considered frequent because their support count satisfies the minimum threshold.

After discovering frequent itemsets, the next step is to generate useful association rules from them.

Rule Generation#

Once we find frequent itemsets, we generate association rules.

For example, if {Bread, Butter} is frequent, possible rules include:

Bread → Butter
Butter → Bread

To evaluate rules, we commonly use:

(1) Support#

Support tells us how often the itemset appears in the dataset.

\[ Support(A \rightarrow B) = Support(A \cup B) \]

(2) Confidence#

Confidence measures how often item B appears in transactions that already contain item A.

\[ Confidence(A \rightarrow B) = \frac{Support(A \cup B)}{Support(A)}\]

Example:

If 4 transactions contain Bread, and 3 of those also contain Butter:

\[ Confidence(Bread \rightarrow Butter) = \frac{3}{4} = 0.75 \]

This means 75% of customers who bought Bread also bought Butter.

If confidence is close to 1 (or 100%), we say the rule has strong confidence because item B almost always appears whenever item A appears.

(BONUS): Lift#

Lift measures whether two items occur together more often than expected by chance.

\[ Lift(A \rightarrow B) = \frac{Confidence(A \rightarrow B)}{Support(B)} \]

Interpretation:

Lift > 1: A and B occur together more often than expected, so they are positively associated.
Lift = 1: A and B are independent.
Lift < 1: A and B occur together less often than expected, so they are negatively associated.

For example, a high lift value means customers who buy item A are much more likely to also buy item B compared to a random customer.

Simple Example#

Suppose we have these transactions:

Transaction	Items
T1	Milk, Bread
T2	Milk, Diaper, Beer
T3	Bread, Diaper, Beer
T4	Milk, Bread, Diaper, Beer
T5	Milk, Bread, Diaper

A possible association rule is:

\[ Diaper \rightarrow Beer \]

This means customers who buy diapers may also buy beer.

Suppose:

Diaper appears in 4 transactions
Diaper and Beer together appear in 3 transactions

Then:

\[ Confidence(Diaper \rightarrow Beer) = \frac{3}{4} = 0.75 \]

This means 75% of customers who bought diapers also bought beer. If many transactions containing diapers also contain beer, then this rule has high confidence.