Re: Applied Analytics Using SAS Enterprise Miner (chapter 8 of course notes / Lesson 2 online)
1. Data structure for Market Basket Analysis (MBA) and Sequence Analysis: just want to make sure I clearly understand how the data should be prepared for those analyses and the differences between the two.
For MBA, each physical record represents an item and the ID represents a "Transaction". For instance, in the Bank dataset (page 8-54 of course notes), all entries with the same Account Number are considered to be part of the same "Transaction"; therefore, the total number of transactions in the dataset is equal to the number of distinct values in Account Number (which is the number used to calculate Support). On the other hand, for Sequence Analysis, in general, the ID represents a specific "Customer" and "Transactions" are identified by the sequence number which is derived by using a time variable. In other words, items bought by a given Customer at the same time, are grouped together to form a "visit/transaction". Therefore, Customer and Transaction are two different concepts when it comes to Sequence Analysis. In the specific example of the Bank dataset, for a given Account Number, each transaction is represented by a different value in "Order of Service Addition" (in the example, it is assumed only 1 product can be purchased in a single transaction). Finally, a "Sequence" is a set of transactions/visits all related to the same Customer, which are temporally related (i.e. they all happened within a certain timeframe in a specific order).
Is the above correct?
2. Rare occurrences: in Market Basket Analysis, is it correct to say that if we were interested in "rare/unusual" combinations of items, then we would look at rules with a low Support and/or possibly a Lift below 1 (see page 8.57 of course notes)?
3. Property "Support Percentage" of Association Node: the "Enterprise Miner 15.1: Reference Help" (at page 410) reports the following: "[...] The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support".
Would it be possible to clarify the meaning and practical implications of that sentence?
4. Structure of rules in Sequence Analysis: is it correct to say that rules derived in a Sequence Analysis can only contain single items in the identified chain; i.e. A => B => C and not A => (B and C) => D?
5. Calculation of Support in Sequence Analysis: would it be possible to clarify how support is calculated (see statement at the bottom of page 8-68 of course notes)?
First of all, in Sequence Analysis, is it correct to say that the left hand side of association rules must be related to a transaction/visit which preceeds, from a temporal point of view, the item(s) on the right hand side?
Secondly, for the calculation of Support, what is the "unit of count"? In other words, what are the numbers used in the denominator and numerator? Are they based on the concept of "sequence" or "individual transaction"?
6. "Association Node Rules Selector": I am not sure I understand the purpose of the Rules Selection (accessed via property "Rules"); why and how should be used?
... View more