Pragmatic Programming Techniques: Machine Learning: Association Rule

A typical example is the "market-basket" problem. Lets say a supermarket keep track of all the purchase transactions. Each purchase transaction is a subset of all the item available in the store. e.g. {beer, milk, diaper, butter}.

The problem is: By analyzing a large set of transactions, can be discover the correlation between subsets ? ie: people buying milk and butter has a high tendency of buying diaper. Or people buying diaper tends to buy soda and ice-cream.

Such correlation is called an association rule, which has the following form:
A => B where A, B are disjoint subsets of U (a universal set)

This rule can be interpreted as: From the transaction statistics, people buying all items in set A tends to also buy all items in set B.

Note that people buying both set A AND set B is denoted as (A union B) rather than (A intersect B).

There are two concepts need to be defined here ...

"Support" is defined with respect to a subset X as the % of total transaction that has contains subset X. This can be indicate as P(contains X). e.g. The support of {beer, diaper} is P(contains {beer, diaper}) which means if we randomly pick a transaction, how likely that it will contain both beer and diaper.

"support" of an association rule A => B is defined as the "support" of (A union B)

"Confidence" is defined with respect to a rule (A => B) that given we know a transaction contains A, how likely that it also contains B.

P(contains B | contains A) = P(contains B union A) / P(contains A) which is the same as Support(A union B) / Support(A)

Mining Association Rules

The problem is how can we discover the association rules that has a high enough "support" and "confidence". First of all, an arbitrary threshold of "support" and "confidence" is set according to domain specific concerns. There are two phases.

1) Extract all subset X where support(X) > thresholdOfSupport
2) For all extracted subset X, discover A => B where A is subset of X and B is (X - A)

1) is also known as the "finding frequent subsets" problem. A naive implementation can generate all possible subsets and check their support value. The naive approach has exponential complexity 2 exp(N) .

Apriori Algorithm to find frequent subsets

This algorithm exploit the fact that if X is not a frequent subset, then (X union Anything) will never be a frequent subset. So it starts with scanning small subsets and throw away those that doesn't has high enough support. In other words, it prune the tree as it grows.

Lets say the universal set is {I1, I2, I3, I4, I5, I6}

First round:

Generate possible candidates of 1-item subset. ie: {I1}, {I2}, ... {I6}
Find out all supports of the candidate set. ie: support({I1}), support({I2}), .... support({I6})
Filter out those whose value < supportThreshold

Second round:

From the surviving 1-item subset, generate possible 2-item subset candidates. ie: {I1, I2}, {I1, I4} ... Note that we can skip any subset that contains I3 because it is out.
Find out all supports of the 2-item candidate set.
Filter out those whose value < supportThreshold

K round: (repeat until no more surviving k-1 item subset)

From the surviving k-1 item subset, generate possible k-item subset candidates by adding one more item that is not already in the k-1 item subset. Skip any k item subset that contains any throw-away k-1 item candidates from the last round.
Calculate the support of k-item candidates
Throw away those whose support value < supportThreshold

Find association rules from frequent subsets

After knowing a frequent k-item subset X, we want to find its subset A such that the confidence value of (A => X-A) is higher than the confidence threshold.

Note that confidence = support(X) / support(A)

Since for a given X, support(X) is fixed, we start with trying to find lowest support(A) because that will give the highest confidence. So we do the reverse process, starting from the largest subset A within X, and reduce the set A in each round.

Notice that if support(A) is not low enough, there is no need to try subset A' because support(A') can only be higher. Therefore we only focus our energy to try subset with the surviving A.

First round:

Within X, generate possible candidates of k-1 item subset.
Find out all confidence of the candidate set.
Filter out those whose confidence value < confidenceThreshold
For those surviving k-1 item subset A, mark the rule (A => X-A)

J round: (repeat until no more surviving j-1 item subset)

Within the surviving j-item subset A, generate possible candidates of j-1 item subset.
Find out all confidence of the candidate set.
Filter out those whose confidence value < confidenceThreshold
For those surviving j-1 item subset A', mark the rule (A' => X-A')

Miscellaneous

Note that confidence is absolute but not relative.
When (A => B) has confidence = 75%, it is also possible that (!A => B) has confidence = 90%. In other words, it is possible that some rules are contradict to each other and usually the one with higher support and confidence wins.

Pragmatic Programming Techniques

Monday, October 26, 2009

Machine Learning: Association Rule

1 comment: