I have data on customer purchase history. I want to score each of these customers based on the attributes. For this, I want to calculate the score by assigning weights to variables, (ex: 10% to v1, 20% to v2, 50% to v3 etc.,) and then sum up these weights. The resultant score should tell me how good a customer is. For instance, a score above 500 means they are good/loyal customers and we can expect good sales from them next time. While the threshold can be decided once we get a score, I want to know how I can approach this problem.
I decided to run PCA, from which I can get the PCA scores and hence use coefficients as weights.
For example, if I select the first principal component and take the coefficients,
y1=0.5v1+0.8v2-0.2v3 ,
replacing v1, v2 , v3 with the values of the attributes, I can get a score of each observation.
I am not sure if this is a clever approach. Is there a better way to optimize the weights and calculate the score of each customer? Any thoughts are appreciated.
There are many approaches as you can see...
As @Reeza shared, the VARCLUS procedure can help reduce the number of interval variables and can generate cluster scores from mutually exclusive subsets of your continuous variables.
As @Ksharp shared, the PRINQUAL procedure provides approaches to including both interval and categorical data in a clustering solution.
The approach you choose must take into account how concerned you with assigning some meaning to the scores. In general, it is easy to come up with formulas to create scores for individuals based on categorical and/or interval inputs. It is harder to determine what those scores actually mean in the case of your business problem. In general, you must determine how to weight different factors based on your business objectives.
Principal Components creates a new set of dimensions for the data with the nice set of properties that the dimensions are uncorrelated with each other and that each subsequent dimension explains less data the previous dimension, but every single variable is included in the equation for each principle component (or PC). The equation for each PC is
PC(k) = X*Beta(k)
where
PC(k) = vector of scores for Principal Component k
X = data matrix containing values of each variable for each observation
B(k) = vector of coefficients for principal component k
rather than
PC(i) = X*beta(i) + error
since there is no error term -- just a restructuring of the space into different dimensions. Variable Clustering is somewhat simpler in that it creates mutually exclusive groups of variables so that the Variable Cluster score depends only on those variables in that cluster rather than on every variable in the data like PCs do. Variable Clustering creates Principal Components as part of its algorithm to create Variable Cluster scores but limits the variables to those variables in the cluster. As a result, Variable Cluster scores cannot perfectly represent all of the original data but the allows the user to choose one or two variables from each cluster (e.g. the most highly correlated variable with the cluster score and least correlated variable with the cluster score) to use as surrogates for all of the variables in the cluster. Using the simpler set of variables allows easier interpretation and avoids redundancy of inputs.
Rather than cramming categorical variables into this mix, it might be better to use Variable Clustering to reduce the interval inputs and retain the categorical inputs assuming the categorical inputs are not poorly distributed (e.g. large numbers of levels, large proportion of levels with trivial amounts of data, etc...). The fact that you can include these in a reduction method like PRINQUAL does not mean that you should!
Regarding creating a score to rank customers, the fact that your approach is unsupervised means that the results must be interpreted from a business standpoint alone as there is no 'correct' score. I could create a Farmer Score using
Farmer Score = 2 * (number of chickens) + 6 * (number of cows)
but what that score means is a business interpretation rather than an analytical interpretation.
To be fair, assessing customer value always difficult. Suppose an airline has two customers on a flight:
Customer A:
* seated in first class
* has platinum status with the airline awards
* joined the admirals club
* has high income
* wearing a suit
Customer B:
* flying in coach
* has not joined the awards program
* not an admirals club member
* appears to be young, possibly just out of college
* wearing a t-shirt and jeans
If you had to ask one of the two customers to get off the plane, which customer do you choose? One thing to ask yourself is which customer is more valuable? The answer to the latter question depends on what time frame you are considering.
If your time frame is...
... past flights - clearly Customer A is more valuable since they have achieved a top status level due to sufficient flights
... the current flight - you might choose customer A but who knows whether or not customer A is using free points for this flight which would associate more current revenue to Customer B
... future flights - this is even murkier since Customer A might be at the end of his career and/or moving to a non-traveling job while Customer B might be joining a consulting firm and will be sent on trips all over the globe during regular business days when traveling is most expensive
Of course, Customer A might cause a lot more stink since they have preferred status but they still might not actually be the most valuable to the companies current and/or future business success in many scenarios.
Hope this helps!
Cordially,
Doug
That is a clever approach.
But PCA is only applied for continuous variables.
And you also missed the second Primary Component, which maybe occupy very big variance of data.
Maybe you could includ these two primary component or three......
Suppose for the first PC,which occupy %60
y1=0.5v1+0.8v2-0.2v3 ,
Suppose for the second PC,which occupy %40
y2=0.5v1+0.8v2-0.2v3 ,
the final score maybe : Y=0.6*Y1+0.4*Y2 ?
So this is an unsupervised learning problem?
You have no data to calibrate your model with?
Or you could use Log-Linear Model.
Check the documentation of PROC CATMOD
Example 32.4: Log-Linear Model, Three Dependent Variables
Note: remove the non-significant variables before applying your model.
Look at proc varclus
Also, make sure to standardize variables. Otherwise larger variables take over.
@Reeza ,
Very good point . That make lots of sense.
Or you could check Possion Model.(which can take care both category and continuous variable)
Do you take a look at PRINQUAL Procedure ?
There are many approaches as you can see...
As @Reeza shared, the VARCLUS procedure can help reduce the number of interval variables and can generate cluster scores from mutually exclusive subsets of your continuous variables.
As @Ksharp shared, the PRINQUAL procedure provides approaches to including both interval and categorical data in a clustering solution.
The approach you choose must take into account how concerned you with assigning some meaning to the scores. In general, it is easy to come up with formulas to create scores for individuals based on categorical and/or interval inputs. It is harder to determine what those scores actually mean in the case of your business problem. In general, you must determine how to weight different factors based on your business objectives.
Principal Components creates a new set of dimensions for the data with the nice set of properties that the dimensions are uncorrelated with each other and that each subsequent dimension explains less data the previous dimension, but every single variable is included in the equation for each principle component (or PC). The equation for each PC is
PC(k) = X*Beta(k)
where
PC(k) = vector of scores for Principal Component k
X = data matrix containing values of each variable for each observation
B(k) = vector of coefficients for principal component k
rather than
PC(i) = X*beta(i) + error
since there is no error term -- just a restructuring of the space into different dimensions. Variable Clustering is somewhat simpler in that it creates mutually exclusive groups of variables so that the Variable Cluster score depends only on those variables in that cluster rather than on every variable in the data like PCs do. Variable Clustering creates Principal Components as part of its algorithm to create Variable Cluster scores but limits the variables to those variables in the cluster. As a result, Variable Cluster scores cannot perfectly represent all of the original data but the allows the user to choose one or two variables from each cluster (e.g. the most highly correlated variable with the cluster score and least correlated variable with the cluster score) to use as surrogates for all of the variables in the cluster. Using the simpler set of variables allows easier interpretation and avoids redundancy of inputs.
Rather than cramming categorical variables into this mix, it might be better to use Variable Clustering to reduce the interval inputs and retain the categorical inputs assuming the categorical inputs are not poorly distributed (e.g. large numbers of levels, large proportion of levels with trivial amounts of data, etc...). The fact that you can include these in a reduction method like PRINQUAL does not mean that you should!
Regarding creating a score to rank customers, the fact that your approach is unsupervised means that the results must be interpreted from a business standpoint alone as there is no 'correct' score. I could create a Farmer Score using
Farmer Score = 2 * (number of chickens) + 6 * (number of cows)
but what that score means is a business interpretation rather than an analytical interpretation.
To be fair, assessing customer value always difficult. Suppose an airline has two customers on a flight:
Customer A:
* seated in first class
* has platinum status with the airline awards
* joined the admirals club
* has high income
* wearing a suit
Customer B:
* flying in coach
* has not joined the awards program
* not an admirals club member
* appears to be young, possibly just out of college
* wearing a t-shirt and jeans
If you had to ask one of the two customers to get off the plane, which customer do you choose? One thing to ask yourself is which customer is more valuable? The answer to the latter question depends on what time frame you are considering.
If your time frame is...
... past flights - clearly Customer A is more valuable since they have achieved a top status level due to sufficient flights
... the current flight - you might choose customer A but who knows whether or not customer A is using free points for this flight which would associate more current revenue to Customer B
... future flights - this is even murkier since Customer A might be at the end of his career and/or moving to a non-traveling job while Customer B might be joining a consulting firm and will be sent on trips all over the globe during regular business days when traveling is most expensive
Of course, Customer A might cause a lot more stink since they have preferred status but they still might not actually be the most valuable to the companies current and/or future business success in many scenarios.
Hope this helps!
Cordially,
Doug
Or Check this:
Overview: PRINQUAL Procedure
The PRINQUAL procedure performs principal component analysis (PCA) of qualitative, quantitative, or
mixed data. PROC PRINQUAL is based on the work of Kruskal and Shepard (1974); Young, Takane, and
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.