Association rule mining is a data mining technique used to uncover meaningful associations, or correlations, frequent patterns, between various items in a database. It is commonly applied in areas like market basket analysis, web usage mining, and more. The primary goal is to identify rules that predict the occurrence of an item based on the presence of other items in a transaction. These rules are used for finding the latest trends and patterns in data and help in business decision-making processes. Many association rule mining algorithms follow a "support-confidence" framework, which involves two main steps. First, the minimum support is used to identify all frequent item sets in the dataset. Second, these frequent item sets, along with the minimum confidence constraint, are used to generate rules. Identifying all frequent item sets in a dataset is challenging due to the combinatorial nature of the task.
The MBANALYSIS procedure in SAS Viya utilizes the frequent-pattern growth (FP-growth) algorithm, developed by Han, Pei, and Yin (2000), to identify frequent item sets and subsequently generates rules based on these item sets. In this post (part-1), I will discuss the workings of the FP-growth algorithm, detailing how it identifies frequent item sets and generates association rules. I will provide a step-by-step overview of the algorithm's process. In part-2, I will walk through its implementation in SAS, showcasing how the MBANALYSIS procedure leverages this algorithm to perform association rule mining. By the end, you will have a clear understanding of both the theoretical foundation and practical application of the FP-growth algorithm in SAS.
This algorithm improves upon the Apriori algorithm by generating frequent patterns without enumerating all the candidates in search space. The FP-growth algorithm organizes the database into a tree structure called a frequent pattern tree (FP-tree). The FP-tree can store frequent item sets in a highly compressed format. To maintain compactness and informativeness, only frequent length-1 items are included as nodes in the tree.
Next, for each item in FP-tree, a Conditional Frequent Pattern Tree is constructed. Then it discovers the frequent item sets by traversing the conditional FP-tree in a depth-first-search (DFS).
Picture 1
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Let's assume we have a transactions data set from a retail store. Each transaction contains a list of items purchased by a customer. The two columns in the table represents our transaction database. Now we will derive a compact data structure from the transaction table. Since only the frequent items will play a role in the frequent-pattern mining, we first derive the set of frequent items (1-item sets) and their support counts (frequencies). To do this we just count the number of times an item appeared in the transaction database, and we find that item 'A' appeared in 4 baskets, 'E' appeared in 2 baskets and so on.
Next for every transaction, we remove the items that are below the minimum support. In this example, we set a support threshold of 3 (i.e., an item set must appear in at least 3 transactions to be considered frequent). Thus, the items E,G,H,S, and U are removed. The set of frequent items is then sorted in the order of descending support count or frequency. Since item ‘A’ and 'N' has highest count of 4 it appears at the top followed by other items in the descending order of support count. This resulting Frequent-Pattern set, or list of frequent items is denoted by L. So, in our example the frequent pattern set L = {(A: 4), (N: 4), (J:3), (M: 3), (O: 3)}. The number after ":" indicates the support.
Then, for each transaction, the corresponding Ordered-Item set is constructed by iterating through the Frequent-Pattern set and verifying whether each item is present in the transaction. If the item is found, it is added to the Ordered-Item set for that specific transaction. This ordering is important since each path of a tree will follow this order. The picture below shows the list of ordered frequent items generated from transaction database.
Picture 2
Note- If multiple transactions share a set of frequent items, it may be possible to merge the shared sets with the number of occurrences registered as count. It is easy to check whether two sets are identical if the frequent items in all of the transactions are listed according to a fixed order.
Constructing an FP- tree:
FP- tree has two components-
Item-prefix tree- Each node in the item prefix tree consists of three fields: item-name, count, and node-link. Here item-name registers which item this node represents, count registers the number of transactions represented by the portion of the path that reaches this node, and node-link links to the next node in the FP-tree that carries the same item name, or null if there is none.
Frequent-item-header table- Each entry in the frequent-item header table consists of two (or three) fields: item-name, count (frequency count of this one-item set), and head of node-link. Head of node-link is a pointer to the first node in the FP-tree carrying the item-name. In our running example, the Header table appears as below-
Picture 3
Now, we begin constructing an FP-tree. First, create the root of the tree, labeled with “Null.” Then we scan the transaction database.
The scan of the first transaction T1, which contains six items leads to the construction of the first branch of the tree with five nodes: (A :1), (N:1), (J:1), (M:1), (O:1) where A is linked as a child to the root (null) node, N is linked as a child to N, J is linked to N and so on. Notice that the frequent items in the transaction are listed according to the order in the frequent pattern set L.
Picture 4
In general, when considering the branch to be added for a transaction, the count of each node along a common prefix is incremented by 1, and nodes for the items following the prefix are created and linked accordingly.
The scan of the second transaction T2 that contains items (N, O) leads to the construction of the second branch of the tree. This is because N is the first item in the list and does not share any common prefix with the existing path for T1 therefore it will be connected to root node and is initialized with the count as 1. Then we create a new node, (O:1), which is linked as a child to (N:1).
Picture 5
For the third transaction (T3), since its ordered frequent item list (A, J) shares only the node (A) with the A-prefix subtree, A's count is incremented by 1 and a new node (J:1) is created and linked as a child of (A:2). For the fourth transaction (T4), since its ordered frequent item list, (A, N, M, O) shares a common prefix (A, N) with the existing path (A. N, J, M, O) the count of each node (N) and (A) along the prefix is incremented by 1 and one new node (M:1) is created and linked as a child of (N:2) and another new node (O:1) is created and linked as the child of (M:1) .
For the last transaction (T5), since its ordered frequent item list (A, N, J, M) shares the node (A), (N), (J), and (M) with the existing path (A, N, J, M, O) the count of each node along the path is incremented by 1.
Now we will see how to explore the compact information stored in an FP-tree for mining the complete set of frequent patterns. Start from each frequent length-1 pattern, construct its conditional pattern base, then construct its (conditional) FP-tree, and perform mining recursively on the tree. We examine the mining process by starting from the bottom of the node-link header table (refer picture 3).
Constructing conditional pattern bases
For node O, its immediate frequent pattern is (O:3) and it has three paths in the FP-tree: (A:4, N:3, J:2, M:2, O:1), (A:4, N:3, M:1, O:1) and (N:1, O:1). The first path indicates that string "(A, N, J, M, O)" appears once in the database. This path also indicates that string "(A, N)" appears three times, string "(A, N, J, M)" appears 2 times and (A) itself appears even four times. However, they only appear once together with O. Thus, to study which string appear together with O, only O's prefix path (A:1, N:1, J:1, M:1) or simply (A, N, J, M :1) counts. The second path indicates string "(A, N, M, O)" appears once in the database or O's prefix path is (A, J, M:1). Similarly, for third path, O's prefix path is (N:1). These three prefix paths of O, "{(A, N, J, M:1), (A, N, M:1), (N:1)}" form O's sub pattern-base, which is called O's conditional pattern base (i.e., the sub pattern base under the condition of O's existence).
Picture 6
For node M, its immediate frequent pattern is (M:3), and it has two paths, (A: 4, N: 3, J: 2, M: 2) and (A: 4, N: 3, M: 1). So, the M's conditional pattern base is (A: 2, N: 2, J: 2) or simply (A, N, J: 2) and (A, N: 1).
Similarly, for node J, its immediate frequent pattern is (J:3), and it has two paths, (A: 4, N: 3, J: 2) and (A: 4, J: 1). So, the J's conditional pattern base is (A, N: 2) and (A :1). And for node N, its immediate frequent pattern is (N:4), and it derives one sub pattern-base (A :3). Node A derives only (A:4) but no conditional pattern-base.
Picture 7
Constructing conditional FP- trees
For each item, we construct a conditional frequent pattern tree and perform mining recursively on the tree. The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree. The conditional FP-tree is constructed by identifying the set of elements that are common across all the paths in the item's conditional pattern base and calculating its support count by summing the support counts of all the paths in that base.
In the running example to build an O's conditional FP- tree we identify the element(s) common across all the three paths in item O's conditional pattern base. Only item N occurs in all the three paths, so summing the support counts of item N in all the paths yields a single branch (N:3) with a support count of 3. As a result only one frequent pattern (N,O: 3) is generated. To get the frequent pattern we concatenate the suffix pattern i.e. O in this case with the frequent patterns generated. The search for frequent patterns related to O then concludes.
Picture 8
For item M, its conditional pattern-base generates a single path conditional FP-tree (A:3, N:3, J:3). Thus, its set of frequent patterns can be generated by taking their combinations. Concatenating them with (M:3) we have set of frequent patterns {(A, M: 3), (N, M: 3), (J, M: 3)}. Item J generates a single path conditional FP-tree (A:3) and one frequent pattern (A, J :3). Again, item N, generates a single path conditional FP-tree (A:3) and one frequent pattern (A, N :3).
Finally, item A has no conditional pattern base hence no conditional frequent pattern.
For each row, the conditional frequent pattern column helps derive association rules. For instance, in the first row containing the element ‘O’, the rules N -> O and O -> N can be inferred. To identify the valid rule, the confidence of both rules is calculated, and the rule with a confidence level greater than or equal to the minimum confidence threshold is kept. In the next post I will discuss how this algorithm is used to perform association rule mining in SAS Viya.
References:
Find more articles from SAS Global Enablement and Learning here.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.