Re: Applied Analytics Using SAS Enterprise Miner (chapter 8 of course notes / Lesson 2 online)
1. Data structure for Market Basket Analysis (MBA) and Sequence Analysis: just want to make sure I clearly understand how the data should be prepared for those analyses and the differences between the two.
For MBA, each physical record represents an item and the ID represents a "Transaction". For instance, in the Bank dataset (page 8-54 of course notes), all entries with the same Account Number are considered to be part of the same "Transaction"; therefore, the total number of transactions in the dataset is equal to the number of distinct values in Account Number (which is the number used to calculate Support).
On the other hand, for Sequence Analysis, in general, the ID represents a specific "Customer" and "Transactions" are identified by the sequence number which is derived by using a time variable. In other words, items bought by a given Customer at the same time, are grouped together to form a "visit/transaction". Therefore, Customer and Transaction are two different concepts when it comes to Sequence Analysis. In the specific example of the Bank dataset, for a given Account Number, each transaction is represented by a different value in "Order of Service Addition" (in the example, it is assumed only 1 product can be purchased in a single transaction).
Finally, a "Sequence" is a set of transactions/visits all related to the same Customer, which are temporally related (i.e. they all happened within a certain timeframe in a specific order).
Is the above correct?
2. Rare occurrences: in Market Basket Analysis, is it correct to say that if we were interested in "rare/unusual" combinations of items, then we would look at rules with a low Support and/or possibly a Lift below 1 (see page 8.57 of course notes)?
3. Property "Support Percentage" of Association Node: the "Enterprise Miner 15.1: Reference Help" (at page 410) reports the following: "[...] The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support".
Would it be possible to clarify the meaning and practical implications of that sentence?
4. Structure of rules in Sequence Analysis: is it correct to say that rules derived in a Sequence Analysis can only contain single items in the identified chain; i.e. A => B => C and not A => (B and C) => D?
5. Calculation of Support in Sequence Analysis: would it be possible to clarify how support is calculated (see statement at the bottom of page 8-68 of course notes)?
First of all, in Sequence Analysis, is it correct to say that the left hand side of association rules must be related to a transaction/visit which preceeds, from a temporal point of view, the item(s) on the right hand side?
Secondly, for the calculation of Support, what is the "unit of count"? In other words, what are the numbers used in the denominator and numerator? Are they based on the concept of "sequence" or "individual transaction"?
6. "Association Node Rules Selector": I am not sure I understand the purpose of the Rules Selection (accessed via property "Rules"); why and how should be used?
1. Data structure for Market Basket Analysis (MBA) and Sequence Analysis: just want to make sure I clearly understand how the data should be prepared for those analyses and the differences between the two.
For MBA, each physical record represents an item and the ID represents a "Transaction". For instance, in the Bank dataset (page 8-54 of course notes), all entries with the same Account Number are considered to be part of the same "Transaction"; therefore, the total number of transactions in the dataset is equal to the number of distinct values in Account Number (which is the number used to calculate Support).
On the other hand, for Sequence Analysis, in general, the ID represents a specific "Customer" and "Transactions" are identified by the sequence number which is derived by using a time variable. In other words, items bought by a given Customer at the same time, are grouped together to form a "visit/transaction". Therefore, Customer and Transaction are two different concepts when it comes to Sequence Analysis. In the specific example of the Bank dataset, for a given Account Number, each transaction is represented by a different value in "Order of Service Addition" (in the example, it is assumed only 1 product can be purchased in a single transaction).
Finally, a "Sequence" is a set of transactions/visits all related to the same Customer, which are temporally related (i.e. they all happened within a certain timeframe in a specific order).
Is the above correct?
MY Answer:
Please see the screenshot of the Metadata:
For market basket analysis:
The data role should be transaction
For sequence analysis:
The role of variable Visit will be changed to sequence
2. Rare occurrences: in Market Basket Analysis, is it correct to say that if we were interested in "rare/unusual" combinations of items, then we would look at rules with a low Support and/or possibly a Lift below 1 (see page 8.57 of course notes)?
My Answer:
Yes you are correct if you are after rare combination of items in MBA
3. Property "Support Percentage" of Association Node: the "Enterprise Miner 15.1: Reference Help" (at page 410) reports the following: "[...] The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support".
Would it be possible to clarify the meaning and practical implications of that sentence?
My Answer:
In order to consider an association rule useful what will be the minimum support % (Default% is 5. But You can modify based on your need)
4. Structure of rules in Sequence Analysis: is it correct to say that rules derived in a Sequence Analysis can only contain single items in the identified chain; i.e. A => B => C and not A => (B and C) => D?
My Answer:
Yes you are correct. Also in a rule you have Left and right side of the rule. You can have two items appeared sequentially in the left side of the rule. But in the right side of the rule you can have only single item. Refer the Rule table in the Sequence analysis output.
5. Calculation of Support in Sequence Analysis: would it be possible to clarify how support is calculated (see statement at the bottom of page 8-68 of course notes)?
First of all, in Sequence Analysis, is it correct to say that the left hand side of association rules must be related to a transaction/visit which precedes, from a temporal point of view, the item(s) on the right hand side? Yes
Secondly, for the calculation of Support, what is the "unit of count"? In other words, what are the numbers used in the denominator and numerator? Are they based on the concept of "sequence" or "individual transaction"? Frequency Count
My Answer: (From the course notes)
The percent support is the transaction count divided by the total number of customers, which
would be the maximum transaction count. The percent confidence is the transaction count
divided by the transaction count for the left side of the sequence.
6. "Association Node Rules Selector": I am not sure I understand the purpose of the Rules Selection (accessed via property "Rules"); why and how should be used?
My Answer:
I am also not sure of the purpose of rule selection. There is another node (Link analysis) available in SAS EM which also perform recommendation analysis. Please check the EM help.
1. Data structure for Market Basket Analysis (MBA) and Sequence Analysis: just want to make sure I clearly understand how the data should be prepared for those analyses and the differences between the two.
For MBA, each physical record represents an item and the ID represents a "Transaction". For instance, in the Bank dataset (page 8-54 of course notes), all entries with the same Account Number are considered to be part of the same "Transaction"; therefore, the total number of transactions in the dataset is equal to the number of distinct values in Account Number (which is the number used to calculate Support).
On the other hand, for Sequence Analysis, in general, the ID represents a specific "Customer" and "Transactions" are identified by the sequence number which is derived by using a time variable. In other words, items bought by a given Customer at the same time, are grouped together to form a "visit/transaction". Therefore, Customer and Transaction are two different concepts when it comes to Sequence Analysis. In the specific example of the Bank dataset, for a given Account Number, each transaction is represented by a different value in "Order of Service Addition" (in the example, it is assumed only 1 product can be purchased in a single transaction).
Finally, a "Sequence" is a set of transactions/visits all related to the same Customer, which are temporally related (i.e. they all happened within a certain timeframe in a specific order).
Is the above correct?
MY Answer:
Please see the screenshot of the Metadata:
For market basket analysis:
The data role should be transaction
For sequence analysis:
The role of variable Visit will be changed to sequence
2. Rare occurrences: in Market Basket Analysis, is it correct to say that if we were interested in "rare/unusual" combinations of items, then we would look at rules with a low Support and/or possibly a Lift below 1 (see page 8.57 of course notes)?
My Answer:
Yes you are correct if you are after rare combination of items in MBA
3. Property "Support Percentage" of Association Node: the "Enterprise Miner 15.1: Reference Help" (at page 410) reports the following: "[...] The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support".
Would it be possible to clarify the meaning and practical implications of that sentence?
My Answer:
In order to consider an association rule useful what will be the minimum support % (Default% is 5. But You can modify based on your need)
4. Structure of rules in Sequence Analysis: is it correct to say that rules derived in a Sequence Analysis can only contain single items in the identified chain; i.e. A => B => C and not A => (B and C) => D?
My Answer:
Yes you are correct. Also in a rule you have Left and right side of the rule. You can have two items appeared sequentially in the left side of the rule. But in the right side of the rule you can have only single item. Refer the Rule table in the Sequence analysis output.
5. Calculation of Support in Sequence Analysis: would it be possible to clarify how support is calculated (see statement at the bottom of page 8-68 of course notes)?
First of all, in Sequence Analysis, is it correct to say that the left hand side of association rules must be related to a transaction/visit which precedes, from a temporal point of view, the item(s) on the right hand side? Yes
Secondly, for the calculation of Support, what is the "unit of count"? In other words, what are the numbers used in the denominator and numerator? Are they based on the concept of "sequence" or "individual transaction"? Frequency Count
My Answer: (From the course notes)
The percent support is the transaction count divided by the total number of customers, which
would be the maximum transaction count. The percent confidence is the transaction count
divided by the transaction count for the left side of the sequence.
6. "Association Node Rules Selector": I am not sure I understand the purpose of the Rules Selection (accessed via property "Rules"); why and how should be used?
My Answer:
I am also not sure of the purpose of rule selection. There is another node (Link analysis) available in SAS EM which also perform recommendation analysis. Please check the EM help.
This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:
Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment
Ready to level-up your skills? Choose your own adventure.