Solved: Re: Clarifications on Associations Rules

pvareschi · Posted 05-12-2020 01:41 AM

Re: Applied Analytics Using SAS Enterprise Miner (chapter 8 of course notes / Lesson 2 online)

1. Data structure for Market Basket Analysis (MBA) and Sequence Analysis: just want to make sure I clearly understand how the data should be prepared for those analyses and the differences between the two.

For MBA, each physical record represents an item and the ID represents a "Transaction". For instance, in the Bank dataset (page 8-54 of course notes), all entries with the same Account Number are considered to be part of the same "Transaction"; therefore, the total number of transactions in the dataset is equal to the number of distinct values in Account Number (which is the number used to calculate Support).
On the other hand, for Sequence Analysis, in general, the ID represents a specific "Customer" and "Transactions" are identified by the sequence number which is derived by using a time variable. In other words, items bought by a given Customer at the same time, are grouped together to form a "visit/transaction". Therefore, Customer and Transaction are two different concepts when it comes to Sequence Analysis. In the specific example of the Bank dataset, for a given Account Number, each transaction is represented by a different value in "Order of Service Addition" (in the example, it is assumed only 1 product can be purchased in a single transaction).
Finally, a "Sequence" is a set of transactions/visits all related to the same Customer, which are temporally related (i.e. they all happened within a certain timeframe in a specific order).

Is the above correct?

2. Rare occurrences: in Market Basket Analysis, is it correct to say that if we were interested in "rare/unusual" combinations of items, then we would look at rules with a low Support and/or possibly a Lift below 1 (see page 8.57 of course notes)?

3. Property "Support Percentage" of Association Node: the "Enterprise Miner 15.1: Reference Help" (at page 410) reports the following: "[...] The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support".

Would it be possible to clarify the meaning and practical implications of that sentence?

4. Structure of rules in Sequence Analysis: is it correct to say that rules derived in a Sequence Analysis can only contain single items in the identified chain; i.e. A => B => C and not A => (B and C) => D?

5. Calculation of Support in Sequence Analysis: would it be possible to clarify how support is calculated (see statement at the bottom of page 8-68 of course notes)?

First of all, in Sequence Analysis, is it correct to say that the left hand side of association rules must be related to a transaction/visit which preceeds, from a temporal point of view, the item(s) on the right hand side?

Secondly, for the calculation of Support, what is the "unit of count"? In other words, what are the numbers used in the denominator and numerator? Are they based on the concept of "sequence" or "individual transaction"?

6. "Association Node Rules Selector": I am not sure I understand the purpose of the Rules Selection (accessed via property "Rules"); why and how should be used?

gcjfernandez · Posted 05-13-2020 02:56 AM

1. Data structure for Market Basket Analysis (MBA) and Sequence Analysis: just want to make sure I clearly understand how the data should be prepared for those analyses and the differences between the two.

For MBA, each physical record represents an item and the ID represents a "Transaction". For instance, in the Bank dataset (page 8-54 of course notes), all entries with the same Account Number are considered to be part of the same "Transaction"; therefore, the total number of transactions in the dataset is equal to the number of distinct values in Account Number (which is the number used to calculate Support).
On the other hand, for Sequence Analysis, in general, the ID represents a specific "Customer" and "Transactions" are identified by the sequence number which is derived by using a time variable. In other words, items bought by a given Customer at the same time, are grouped together to form a "visit/transaction". Therefore, Customer and Transaction are two different concepts when it comes to Sequence Analysis. In the specific example of the Bank dataset, for a given Account Number, each transaction is represented by a different value in "Order of Service Addition" (in the example, it is assumed only 1 product can be purchased in a single transaction).
Finally, a "Sequence" is a set of transactions/visits all related to the same Customer, which are temporally related (i.e. they all happened within a certain timeframe in a specific order).

Is the above correct?

MY Answer:

Please see the screenshot of the Metadata:

For market basket analysis:

The data role should be transaction

For sequence analysis:

The role of variable Visit will be changed to sequence

2. Rare occurrences: in Market Basket Analysis, is it correct to say that if we were interested in "rare/unusual" combinations of items, then we would look at rules with a low Support and/or possibly a Lift below 1 (see page 8.57 of course notes)?

My Answer:

Yes you are correct if you are after rare combination of items in MBA
3. Property "Support Percentage" of Association Node: the "Enterprise Miner 15.1: Reference Help" (at page 410) reports the following: "[...] The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support".

Would it be possible to clarify the meaning and practical implications of that sentence?

My Answer:

In order to consider an association rule useful what will be the minimum support % (Default% is 5. But You can modify based on your need)

4. Structure of rules in Sequence Analysis: is it correct to say that rules derived in a Sequence Analysis can only contain single items in the identified chain; i.e. A => B => C and not A => (B and C) => D?

My Answer:

Yes you are correct. Also in a rule you have Left and right side of the rule. You can have two items appeared sequentially in the left side of the rule. But in the right side of the rule you can have only single item. Refer the Rule table in the Sequence analysis output.
5. Calculation of Support in Sequence Analysis: would it be possible to clarify how support is calculated (see statement at the bottom of page 8-68 of course notes)?

First of all, in Sequence Analysis, is it correct to say that the left hand side of association rules must be related to a transaction/visit which precedes, from a temporal point of view, the item(s) on the right hand side? Yes

Secondly, for the calculation of Support, what is the "unit of count"? In other words, what are the numbers used in the denominator and numerator? Are they based on the concept of "sequence" or "individual transaction"? Frequency Count

My Answer: (From the course notes)

The percent support is the transaction count divided by the total number of customers, which
would be the maximum transaction count. The percent confidence is the transaction count
divided by the transaction count for the left side of the sequence.

6. "Association Node Rules Selector": I am not sure I understand the purpose of the Rules Selection (accessed via property "Rules"); why and how should be used?

My Answer:

I am also not sure of the purpose of rule selection. There is another node (Link analysis) available in SAS EM which also perform recommendation analysis. Please check the EM help.

View solution in original post

gcjfernandez · Posted 05-13-2020 02:56 AM

1. Data structure for Market Basket Analysis (MBA) and Sequence Analysis: just want to make sure I clearly understand how the data should be prepared for those analyses and the differences between the two.

For MBA, each physical record represents an item and the ID represents a "Transaction". For instance, in the Bank dataset (page 8-54 of course notes), all entries with the same Account Number are considered to be part of the same "Transaction"; therefore, the total number of transactions in the dataset is equal to the number of distinct values in Account Number (which is the number used to calculate Support).
On the other hand, for Sequence Analysis, in general, the ID represents a specific "Customer" and "Transactions" are identified by the sequence number which is derived by using a time variable. In other words, items bought by a given Customer at the same time, are grouped together to form a "visit/transaction". Therefore, Customer and Transaction are two different concepts when it comes to Sequence Analysis. In the specific example of the Bank dataset, for a given Account Number, each transaction is represented by a different value in "Order of Service Addition" (in the example, it is assumed only 1 product can be purchased in a single transaction).
Finally, a "Sequence" is a set of transactions/visits all related to the same Customer, which are temporally related (i.e. they all happened within a certain timeframe in a specific order).

Is the above correct?

MY Answer:

Please see the screenshot of the Metadata:

For market basket analysis:

The data role should be transaction

For sequence analysis:

The role of variable Visit will be changed to sequence

2. Rare occurrences: in Market Basket Analysis, is it correct to say that if we were interested in "rare/unusual" combinations of items, then we would look at rules with a low Support and/or possibly a Lift below 1 (see page 8.57 of course notes)?

My Answer:

Yes you are correct if you are after rare combination of items in MBA
3. Property "Support Percentage" of Association Node: the "Enterprise Miner 15.1: Reference Help" (at page 410) reports the following: "[...] The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support".

Would it be possible to clarify the meaning and practical implications of that sentence?

My Answer:

In order to consider an association rule useful what will be the minimum support % (Default% is 5. But You can modify based on your need)

4. Structure of rules in Sequence Analysis: is it correct to say that rules derived in a Sequence Analysis can only contain single items in the identified chain; i.e. A => B => C and not A => (B and C) => D?

My Answer:

Yes you are correct. Also in a rule you have Left and right side of the rule. You can have two items appeared sequentially in the left side of the rule. But in the right side of the rule you can have only single item. Refer the Rule table in the Sequence analysis output.
5. Calculation of Support in Sequence Analysis: would it be possible to clarify how support is calculated (see statement at the bottom of page 8-68 of course notes)?

First of all, in Sequence Analysis, is it correct to say that the left hand side of association rules must be related to a transaction/visit which precedes, from a temporal point of view, the item(s) on the right hand side? Yes

Secondly, for the calculation of Support, what is the "unit of count"? In other words, what are the numbers used in the denominator and numerator? Are they based on the concept of "sequence" or "individual transaction"? Frequency Count

My Answer: (From the course notes)

The percent support is the transaction count divided by the total number of customers, which
would be the maximum transaction count. The percent confidence is the transaction count
divided by the transaction count for the left side of the sequence.

6. "Association Node Rules Selector": I am not sure I understand the purpose of the Rules Selection (accessed via property "Rules"); why and how should be used?

My Answer:

I am also not sure of the purpose of rule selection. There is another node (Link analysis) available in SAS EM which also perform recommendation analysis. Please check the EM help.

Clarifications on Associations Rules

Re: Clarifications on Associations Rules

Re: Clarifications on Associations Rules

SAS Training: Just a Click Away