Market Basket Analysis (part 2): Implementing Frequent-Pattern Growth Algorithm in SAS Viya

1 Like

In my previous post, I explored how the Frequent-Pattern growth algorithm works, explaining how it identifies frequent item sets and generates association rules. In this article, I showcase how to perform association rule mining on a transaction database using the MBANALYSIS procedure. It employs the FP-growth algorithm by Han, Pei, and Yin (2000) to identify frequent item sets and subsequently generates rules based on these sets.

Let’s take an example from a grocery store context to better understand the concepts of market basket analysis. For simplicity, let's assume that the lowest level of the taxonomy represents the actual items in the customer baskets. The item taxonomy for this example is shown below.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The transaction data is shown in the table below. Note that each transaction contains information about the customer, and item or product. The Customer ID variable is named Customer. The target variable is named Item. To perform association discovery, the input data table must have a separate observation for each product that is purchased by each customer. You must assign the ID role to one variable and the target role to another variable when you create the data source. Only one variable is supported as the customer variable and only one variable is supported as the target variable.

Training Model in SAS Studio

Launch SAS Studio and submit following program to start a CAS session and assign libraries.

cas;
caslib _ALL_ assign;

Next, I will create the transaction data set (ProdSales) and load it into CASUSER library using following DATA steps

data casuser.ProdSales;
length Customer $ 10 Item $ 20 ;
retain Customer;
customer='Anne';
Item='Low Fat Milk'; output;
Item='Cheddar Cheese'; output;
Item='Cake'; output;
Item='Frozen Pizza'; output;
Item='Ice Cream'; output;
Item='Pancakes'; output;
customer='Bob';
Item='Low Fat Milk'; output;
Item='Swiss Cheese'; output;
Item='Frozen Pizza'; output;
Item='Ice Cream'; output;
customer='Chris';
Item='Skim Milk'; output;
Item='Swiss Cheese'; output;
Item='Ice Cream'; output;
Item='Cake'; output;
run;

The MBANALYSIS procedure supports hierarchical transaction data as input, so now I will create two hierarchies, casuser.BaseLevel and casuser.ParentLevel using the following DATA steps

data casuser.BaseLevel;
length Item $20 Category $20;
Item='Whole Milk'; Category='Milk'; output;
Item='Low Fat Milk'; Category='Milk'; output;
Item='Skim Milk'; Category='Milk'; output;
Item='Swiss Cheese'; Category='Cheese'; output;
Item='Cheddar Cheese'; Category='Cheese'; output;
Item='Waffles'; Category='Breakfast'; output;
Item='Pancakes'; Category='Breakfast'; output;
Item='Frozen Pizza'; Category='Dinner'; output;
Item='Ice Cream'; Category='Dessert'; output;
Item='Cake'; Category='Dessert'; output;
run;

data casuser.ParentLevel;
length Item $20 Category $20;
Item='Milk'; Category='Dairy Products'; output;
Item='Cheese'; Category='Dairy Products'; output;
Item='Breakfast'; Category='Frozen Foods'; output;
Item='Dinner'; Category='Frozen Foods'; output;
Item='Dessert'; Category='Frozen Foods'; output;
run;

Note: When one item has multiple parents, only the last parent is considered for generating rules. The last parent item might vary if the data are distributed.

The following statements run the association analysis on the casuser.ProdSales data table:

proc mbanalysis data=casuser.ProdSales items=2 support=1;
output out=casuser.out outfreq=casuser.outfreq outrule=casuser.outrule;
customer Customer;
target Item;
hierarchy data = casuser.BaseLevel casuser.ParentLevel;
savestate rstore=casuser.StateArm;
run;

The PROC MBANALYSIS statement invokes the procedure. The DATA= option names the input data table to be analyzed. The ITEMS= option specifies the number of items in a rule. The SUPPORT= option specifies the minimum level of support for a rule, where number must be an integer greater than or equal to 1. The OUTPUT statement specifies output tables to contain results of the MBANALYSIS procedure. In our running example the OUT= table contains information about frequent item sets with their transaction counts and support. The OUTFREQ= table contains information about the unique frequent items with their transaction counts and support and the OUTRULE= table contains information about the rules. The CUSTOMER statement requests that the Customer variable be used to group the target variable into baskets. The TARGET statement requests that the Item variable (nominal variable) be used as the target variable. The HIERARCHY statement specifies the item’s hierarchies, and each level of hierarchy is specified in a separate data table. The RSTORE= option in the SAVESTATE statement specifies the binary scoring file to be used for scoring. You can use the analytic store in the ASTORE procedure to score new data. Now run the entire code. The successful execution of code produces output tables namely OUT, OUTFREQ, OUTRULE and the ASTORE binary file StateArm. A snapshot of the OUTRULE table (sorted on Lift values) is displayed below:

Let us examine the first association rule (RULEID 147) and interpret it. The rule Pancakes --> Cheddar Cheese has Support= 33.33%,

Confidence = 100% and Lift = 3. This rule says that 33.33% of customers buy Pancakes and Cheddar Cheese together and those who buy Pancakes also buy Cheddar Cheese 100% of the time. The lift value of three implies that you are thrice as likely to buy Cheddar Cheese if you bought Pancakes than if you did not buy Pancakes. Also, note a column displaying (SUPLIFT) Support Lift statistic. Support lift is computed as the deviation from the actual support to the estimated support of LHS and RHS items in the rule. It is calculated only when the hierarchy data is specified. A large value of support lift suggests that the rule is more significant and if it is close to zero, it means that the rule carries no extra information and can be replaced by a rule containing the parent items. Similarly, other rules can be interpreted using the displayed statistics.

Now, let us create rules involving more than two items (say 4 items). To do this, I modify options in PROC MBANALYSIS statement. The modified code is reproduced below:

proc mbanalysis data=casuser.ProdSales items=4 support=1 separator="\"
nLHS_range=(2,3) nrhs_range=(2,3);
output out=casuser.out4 outfreq=casuser.outfreq4 outrule=casuser.outrule4;
customer Customer;
target Item;
hierarchy data = casuser.BaseLevel casuser.ParentLevel;
savestate rstore=casuser.StateArm4;
run;

In PROC statement, I changed ITEMS= 4 to allow up to 4 items in a rule. The SEPARATOR= option specifies separator character in the antecedent (left-hand side) or consequent (right-hand side) of a rule. The NLHS_RANGE= option and the NRHS_RANGE= option specify range of number of items in the left-hand side rule and right-hand side of a rule respectively. Now, I run the above code and examine the OUTRULE4 table (sorted on Lift values).

The interpretation of rules involving 4 items can be done in the same way as for the rules with 2 items.

Additional comments

The MBANALYSIS procedure can generate a large number of rules, but there is no certainty that any of them will be valuable. There is no automated method to distinguish the useful results from the obvious or trivial ones. The rules need to be reviewed and assessed by someone who can recognize a meaningful and actionable rule when they encounter it. The number and types of rules generated can be adjusted by setting limits on the options in the PROC MBANALYSIS statement. For instance, increasing the minimum support threshold for rules may reduce the number of rules, but it could also exclude some interesting ones. This is because a rule with low support may still have relatively high confidence and lift.

References:

SAS Documentation

Find more articles from SAS Global Enablement and Learning here.