BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
vsharipriya
Fluorite | Level 6

I have data on customer purchase history. I want to score each of these customers based on the attributes. For this, I want to calculate the score by assigning weights to variables, (ex: 10% to v1, 20% to v2, 50% to v3 etc.,) and then sum up these weights. The resultant score should tell me how good a customer is. For instance, a score above 500 means they are good/loyal customers and we can expect good sales from them next time. While the threshold can be decided once we get a score, I want to know how I can approach this problem. 

 

I decided to run PCA, from which I can get the PCA scores and hence use coefficients as weights. 

For example, if I select the first principal component and take the coefficients,

y1=0.5v1+0.8v2-0.2v3 , 

replacing v1, v2 , v3 with the values of the attributes, I can get a score of each observation. 

 

I am not sure if this is a clever approach. Is there a better way to optimize the weights and calculate the score of each customer? Any thoughts are appreciated.

 

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

There are many approaches as you can see...

 

As @Reeza shared, the VARCLUS procedure can help reduce the number of interval variables and can generate cluster scores from mutually exclusive subsets of your continuous variables.

 

As @Ksharp shared, the PRINQUAL procedure provides approaches to including both interval and categorical data in a clustering solution.  

 

The approach you choose must take into account how concerned you with assigning some meaning to the scores.   In general, it is easy to come up with formulas to create scores for individuals based on categorical and/or interval inputs.  It is harder to determine what those scores actually mean in the case of your business problem.  In general, you must determine how to weight different factors based on your business objectives. 

 

Principal Components creates a new set of dimensions for the data with the nice set of properties that the dimensions are uncorrelated with each other and that each subsequent dimension explains less data the previous dimension, but every single variable is included in the equation for each principle component (or PC).   The equation for each PC is  

 

         PC(k) = X*Beta(k) 

 

where 

 

          PC(k) = vector of scores for Principal Component k

                X =  data matrix containing values of each variable for each observation

             B(k) = vector of coefficients for principal component k

 

rather than 

 

         PC(i) = X*beta(i) + error 

 

since there is no error term -- just a restructuring of the space into different dimensions.  Variable Clustering is somewhat simpler in that it creates mutually exclusive groups of variables so that the Variable Cluster score depends only on those variables in that cluster rather than on every variable in the data like PCs do.   Variable Clustering creates Principal Components as part of its algorithm to create Variable Cluster scores but limits the variables to those variables in the cluster.  As a result, Variable Cluster scores cannot perfectly represent all of the original data but the allows the user to choose one or two variables from each cluster (e.g. the most highly correlated variable with the cluster score and least correlated variable with the cluster score) to use as surrogates for all of the variables in the cluster.  Using the simpler set of variables allows easier interpretation and avoids redundancy of inputs. 

 

Rather than cramming categorical variables into this mix, it might be better to use Variable Clustering to reduce the interval inputs and retain the categorical inputs assuming the categorical inputs are not poorly distributed (e.g. large numbers of levels, large proportion of levels with trivial amounts of data, etc...).  The fact that you can include these in a reduction method like PRINQUAL does not mean that you should!  

 

Regarding creating a score to rank customers, the fact that your approach is unsupervised means that the results must be interpreted from a business standpoint alone as there is no 'correct' score.  I could create a Farmer Score using 

 

    Farmer Score = 2 * (number of chickens) + 6 * (number of cows) 

 

but what that score means is a business interpretation rather than an analytical interpretation. 

 

To be fair, assessing customer value always difficult.  Suppose an airline has two customers on a flight:

 

Customer A:

* seated in first class

* has platinum status with the airline awards

* joined the admirals club

* has high income 

* wearing a suit

 

Customer B:

* flying in coach

* has not joined the awards program

* not an admirals club member

* appears to be young, possibly just out of college

* wearing a t-shirt and jeans

 

If you had to ask one of the two customers to get off the plane, which customer do you choose?   One thing to ask yourself is which customer is more valuable?   The answer to the latter question depends on what time frame you are considering. 

 

If your time frame is...

... past flights - clearly Customer A is more valuable since they have achieved a top status level due to sufficient flights 

... the current flight - you might choose customer A but who knows whether or not customer A is using free points for this flight which would associate more current revenue to Customer B

... future flights - this is even murkier since Customer A might be at the end of his career and/or moving to a non-traveling job while Customer B might be joining a consulting firm and will be sent on trips all over the globe during regular business days when traveling is most expensive

 

Of course, Customer A might cause a lot more stink since they have preferred status but they still might not actually be the most valuable to the companies current and/or future business success in many scenarios.    

 

Hope this helps!

 

Cordially,
Doug

View solution in original post

11 REPLIES 11
Ksharp
Super User

That is a clever approach.

But PCA is only applied for continuous variables.

And you also missed the second Primary Component, which maybe occupy very big variance of data.

 

Maybe you could includ these two primary component or three......

Suppose for the first PC,which occupy %60 

y1=0.5v1+0.8v2-0.2v3 , 

 

Suppose for the second PC,which occupy %40 

y2=0.5v1+0.8v2-0.2v3 , 

 

 

the final score maybe : Y=0.6*Y1+0.4*Y2 ?

 

Reeza
Super User

So this is an unsupervised learning problem? 

You have no data to calibrate your model with?

Ksharp
Super User

Or you could use Log-Linear Model.

Check the documentation of PROC CATMOD

Example 32.4: Log-Linear Model, Three Dependent Variables

 

Note: remove the non-significant variables before applying your model.

Reeza
Super User

Look at proc varclus

 

Also, make sure to standardize variables. Otherwise larger variables take over. 

Ksharp
Super User

@Reeza ,

Very good point . That make lots of sense.

 

Or you could check Possion Model.(which can take care both category and continuous variable)

http://support.sas.com/kb/24/188.html 

vsharipriya
Fluorite | Level 6
Thanks, @Ksharp.
Looks like Possion Model works for supervised model. I don't have any target variable in my data , that is related to other variables.

I want each observation to get a weight based on the weights of other variables, exactly like your first answer-

"Suppose for the first PC,which occupy %60

y1=0.5v1+0.8v2-0.2v3 ,



Suppose for the second PC,which occupy %40

y2=0.5v1+0.8v2-0.2v3 ,





the final score maybe : Y=0.6*Y1+0.4*Y2 "


But here , Y is my each observation, and X are my variables, coefficients being the weights.

X's are both categorical and continuous.
Ksharp
Super User

Do you take a look at PRINQUAL Procedure ?

vsharipriya
Fluorite | Level 6
@Reeza,
Is there any way we could use proc varclus for all types of variables?

I looked at the documentation and it says it takes all the numerical values by default.

My dataset has both categorical and continuous variables. Also some of the categorical variables coded as 1,0.
DougWielenga
SAS Employee

There are many approaches as you can see...

 

As @Reeza shared, the VARCLUS procedure can help reduce the number of interval variables and can generate cluster scores from mutually exclusive subsets of your continuous variables.

 

As @Ksharp shared, the PRINQUAL procedure provides approaches to including both interval and categorical data in a clustering solution.  

 

The approach you choose must take into account how concerned you with assigning some meaning to the scores.   In general, it is easy to come up with formulas to create scores for individuals based on categorical and/or interval inputs.  It is harder to determine what those scores actually mean in the case of your business problem.  In general, you must determine how to weight different factors based on your business objectives. 

 

Principal Components creates a new set of dimensions for the data with the nice set of properties that the dimensions are uncorrelated with each other and that each subsequent dimension explains less data the previous dimension, but every single variable is included in the equation for each principle component (or PC).   The equation for each PC is  

 

         PC(k) = X*Beta(k) 

 

where 

 

          PC(k) = vector of scores for Principal Component k

                X =  data matrix containing values of each variable for each observation

             B(k) = vector of coefficients for principal component k

 

rather than 

 

         PC(i) = X*beta(i) + error 

 

since there is no error term -- just a restructuring of the space into different dimensions.  Variable Clustering is somewhat simpler in that it creates mutually exclusive groups of variables so that the Variable Cluster score depends only on those variables in that cluster rather than on every variable in the data like PCs do.   Variable Clustering creates Principal Components as part of its algorithm to create Variable Cluster scores but limits the variables to those variables in the cluster.  As a result, Variable Cluster scores cannot perfectly represent all of the original data but the allows the user to choose one or two variables from each cluster (e.g. the most highly correlated variable with the cluster score and least correlated variable with the cluster score) to use as surrogates for all of the variables in the cluster.  Using the simpler set of variables allows easier interpretation and avoids redundancy of inputs. 

 

Rather than cramming categorical variables into this mix, it might be better to use Variable Clustering to reduce the interval inputs and retain the categorical inputs assuming the categorical inputs are not poorly distributed (e.g. large numbers of levels, large proportion of levels with trivial amounts of data, etc...).  The fact that you can include these in a reduction method like PRINQUAL does not mean that you should!  

 

Regarding creating a score to rank customers, the fact that your approach is unsupervised means that the results must be interpreted from a business standpoint alone as there is no 'correct' score.  I could create a Farmer Score using 

 

    Farmer Score = 2 * (number of chickens) + 6 * (number of cows) 

 

but what that score means is a business interpretation rather than an analytical interpretation. 

 

To be fair, assessing customer value always difficult.  Suppose an airline has two customers on a flight:

 

Customer A:

* seated in first class

* has platinum status with the airline awards

* joined the admirals club

* has high income 

* wearing a suit

 

Customer B:

* flying in coach

* has not joined the awards program

* not an admirals club member

* appears to be young, possibly just out of college

* wearing a t-shirt and jeans

 

If you had to ask one of the two customers to get off the plane, which customer do you choose?   One thing to ask yourself is which customer is more valuable?   The answer to the latter question depends on what time frame you are considering. 

 

If your time frame is...

... past flights - clearly Customer A is more valuable since they have achieved a top status level due to sufficient flights 

... the current flight - you might choose customer A but who knows whether or not customer A is using free points for this flight which would associate more current revenue to Customer B

... future flights - this is even murkier since Customer A might be at the end of his career and/or moving to a non-traveling job while Customer B might be joining a consulting firm and will be sent on trips all over the globe during regular business days when traveling is most expensive

 

Of course, Customer A might cause a lot more stink since they have preferred status but they still might not actually be the most valuable to the companies current and/or future business success in many scenarios.    

 

Hope this helps!

 

Cordially,
Doug

Ksharp
Super User

Or Check this:

 

 

Overview: PRINQUAL Procedure
The PRINQUAL procedure performs principal component analysis (PCA) of qualitative, quantitative, or
mixed data. PROC PRINQUAL is based on the work of Kruskal and Shepard (1974); Young, Takane, and

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 14202 views
  • 2 likes
  • 4 in conversation