Solved: Re: Assigning weights to variables to calculate rank/score of a custom...

vsharipriya · Posted 10-20-2016 10:38 PM

I have data on customer purchase history. I want to score each of these customers based on the attributes. For this, I want to calculate the score by assigning weights to variables, (ex: 10% to v1, 20% to v2, 50% to v3 etc.,) and then sum up these weights. The resultant score should tell me how good a customer is. For instance, a score above 500 means they are good/loyal customers and we can expect good sales from them next time. While the threshold can be decided once we get a score, I want to know how I can approach this problem.

I decided to run PCA, from which I can get the PCA scores and hence use coefficients as weights.

For example, if I select the first principal component and take the coefficients,

y1=0.5v1+0.8v2-0.2v3 ,

replacing v1, v2 , v3 with the values of the attributes, I can get a score of each observation.

I am not sure if this is a clever approach. Is there a better way to optimize the weights and calculate the score of each customer? Any thoughts are appreciated.

DougWielenga · Posted 01-05-2018 11:36 AM

There are many approaches as you can see...

As @Reeza shared, the VARCLUS procedure can help reduce the number of interval variables and can generate cluster scores from mutually exclusive subsets of your continuous variables.

As @Ksharp shared, the PRINQUAL procedure provides approaches to including both interval and categorical data in a clustering solution.

The approach you choose must take into account how concerned you with assigning some meaning to the scores. In general, it is easy to come up with formulas to create scores for individuals based on categorical and/or interval inputs. It is harder to determine what those scores actually mean in the case of your business problem. In general, you must determine how to weight different factors based on your business objectives.

Principal Components creates a new set of dimensions for the data with the nice set of properties that the dimensions are uncorrelated with each other and that each subsequent dimension explains less data the previous dimension, but every single variable is included in the equation for each principle component (or PC). The equation for each PC is

PC(k) = X*Beta(k)

where

PC(k) = vector of scores for Principal Component k

X = data matrix containing values of each variable for each observation

B(k) = vector of coefficients for principal component k

rather than

PC(i) = X*beta(i) + error

since there is no error term -- just a restructuring of the space into different dimensions. Variable Clustering is somewhat simpler in that it creates mutually exclusive groups of variables so that the Variable Cluster score depends only on those variables in that cluster rather than on every variable in the data like PCs do. Variable Clustering creates Principal Components as part of its algorithm to create Variable Cluster scores but limits the variables to those variables in the cluster. As a result, Variable Cluster scores cannot perfectly represent all of the original data but the allows the user to choose one or two variables from each cluster (e.g. the most highly correlated variable with the cluster score and least correlated variable with the cluster score) to use as surrogates for all of the variables in the cluster. Using the simpler set of variables allows easier interpretation and avoids redundancy of inputs.

Rather than cramming categorical variables into this mix, it might be better to use Variable Clustering to reduce the interval inputs and retain the categorical inputs assuming the categorical inputs are not poorly distributed (e.g. large numbers of levels, large proportion of levels with trivial amounts of data, etc...). The fact that you can include these in a reduction method like PRINQUAL does not mean that you should!

Regarding creating a score to rank customers, the fact that your approach is unsupervised means that the results must be interpreted from a business standpoint alone as there is no 'correct' score. I could create a Farmer Score using

Farmer Score = 2 * (number of chickens) + 6 * (number of cows)

but what that score means is a business interpretation rather than an analytical interpretation.

To be fair, assessing customer value always difficult. Suppose an airline has two customers on a flight:

Customer A:

* seated in first class

* has platinum status with the airline awards

* joined the admirals club

* has high income

* wearing a suit

Customer B:

* flying in coach

* has not joined the awards program

* not an admirals club member

* appears to be young, possibly just out of college

* wearing a t-shirt and jeans

If you had to ask one of the two customers to get off the plane, which customer do you choose? One thing to ask yourself is which customer is more valuable? The answer to the latter question depends on what time frame you are considering.

If your time frame is...

... past flights - clearly Customer A is more valuable since they have achieved a top status level due to sufficient flights

... the current flight - you might choose customer A but who knows whether or not customer A is using free points for this flight which would associate more current revenue to Customer B

... future flights - this is even murkier since Customer A might be at the end of his career and/or moving to a non-traveling job while Customer B might be joining a consulting firm and will be sent on trips all over the globe during regular business days when traveling is most expensive

Of course, Customer A might cause a lot more stink since they have preferred status but they still might not actually be the most valuable to the companies current and/or future business success in many scenarios.

Hope this helps!

Cordially,
Doug

View solution in original post

Ksharp · Posted 10-20-2016 11:31 PM

That is a clever approach.

But PCA is only applied for continuous variables.

And you also missed the second Primary Component, which maybe occupy very big variance of data.

Maybe you could includ these two primary component or three......

Suppose for the first PC,which occupy %60

y1=0.5v1+0.8v2-0.2v3 ,

Suppose for the second PC,which occupy %40

y2=0.5v1+0.8v2-0.2v3 ,

the final score maybe : Y=0.6*Y1+0.4*Y2 ?

Reeza · Posted 10-20-2016 11:53 PM

So this is an unsupervised learning problem?

You have no data to calibrate your model with?

vsharipriya · Posted 10-21-2016 12:12 AM

Yes @Reeza. It is unsupervised.

Ksharp · Posted 10-20-2016 11:58 PM

Or you could use Log-Linear Model.

Check the documentation of PROC CATMOD

Example 32.4: Log-Linear Model, Three Dependent Variables

Note: remove the non-significant variables before applying your model.

Reeza · Posted 10-21-2016 12:31 AM

Look at proc varclus

Also, make sure to standardize variables. Otherwise larger variables take over.

Ksharp · Posted 10-21-2016 03:22 AM

@Reeza ,

Very good point . That make lots of sense.

Or you could check Possion Model.(which can take care both category and continuous variable)

http://support.sas.com/kb/24/188.html

vsharipriya · Posted 10-25-2016 03:17 PM

Thanks, @Ksharp.
Looks like Possion Model works for supervised model. I don't have any target variable in my data , that is related to other variables.

I want each observation to get a weight based on the weights of other variables, exactly like your first answer-

"Suppose for the first PC,which occupy %60

y1=0.5v1+0.8v2-0.2v3 ,

Suppose for the second PC,which occupy %40

y2=0.5v1+0.8v2-0.2v3 ,

the final score maybe : Y=0.6*Y1+0.4*Y2 "

But here , Y is my each observation, and X are my variables, coefficients being the weights.

X's are both categorical and continuous.

Ksharp · Posted 10-25-2016 10:06 PM

Do you take a look at PRINQUAL Procedure ?

vsharipriya · Posted 10-25-2016 03:01 PM

@Reeza,
Is there any way we could use proc varclus for all types of variables?

I looked at the documentation and it says it takes all the numerical values by default.

My dataset has both categorical and continuous variables. Also some of the categorical variables coded as 1,0.

DougWielenga · Posted 01-05-2018 11:36 AM

There are many approaches as you can see...

As @Reeza shared, the VARCLUS procedure can help reduce the number of interval variables and can generate cluster scores from mutually exclusive subsets of your continuous variables.

As @Ksharp shared, the PRINQUAL procedure provides approaches to including both interval and categorical data in a clustering solution.

The approach you choose must take into account how concerned you with assigning some meaning to the scores. In general, it is easy to come up with formulas to create scores for individuals based on categorical and/or interval inputs. It is harder to determine what those scores actually mean in the case of your business problem. In general, you must determine how to weight different factors based on your business objectives.

Principal Components creates a new set of dimensions for the data with the nice set of properties that the dimensions are uncorrelated with each other and that each subsequent dimension explains less data the previous dimension, but every single variable is included in the equation for each principle component (or PC). The equation for each PC is

PC(k) = X*Beta(k)

where

PC(k) = vector of scores for Principal Component k

X = data matrix containing values of each variable for each observation

B(k) = vector of coefficients for principal component k

rather than

PC(i) = X*beta(i) + error

since there is no error term -- just a restructuring of the space into different dimensions. Variable Clustering is somewhat simpler in that it creates mutually exclusive groups of variables so that the Variable Cluster score depends only on those variables in that cluster rather than on every variable in the data like PCs do. Variable Clustering creates Principal Components as part of its algorithm to create Variable Cluster scores but limits the variables to those variables in the cluster. As a result, Variable Cluster scores cannot perfectly represent all of the original data but the allows the user to choose one or two variables from each cluster (e.g. the most highly correlated variable with the cluster score and least correlated variable with the cluster score) to use as surrogates for all of the variables in the cluster. Using the simpler set of variables allows easier interpretation and avoids redundancy of inputs.

Rather than cramming categorical variables into this mix, it might be better to use Variable Clustering to reduce the interval inputs and retain the categorical inputs assuming the categorical inputs are not poorly distributed (e.g. large numbers of levels, large proportion of levels with trivial amounts of data, etc...). The fact that you can include these in a reduction method like PRINQUAL does not mean that you should!

Regarding creating a score to rank customers, the fact that your approach is unsupervised means that the results must be interpreted from a business standpoint alone as there is no 'correct' score. I could create a Farmer Score using

Farmer Score = 2 * (number of chickens) + 6 * (number of cows)

but what that score means is a business interpretation rather than an analytical interpretation.

To be fair, assessing customer value always difficult. Suppose an airline has two customers on a flight:

Customer A:

* seated in first class

* has platinum status with the airline awards

* joined the admirals club

* has high income

* wearing a suit

Customer B:

* flying in coach

* has not joined the awards program

* not an admirals club member

* appears to be young, possibly just out of college

* wearing a t-shirt and jeans

If you had to ask one of the two customers to get off the plane, which customer do you choose? One thing to ask yourself is which customer is more valuable? The answer to the latter question depends on what time frame you are considering.

If your time frame is...

... past flights - clearly Customer A is more valuable since they have achieved a top status level due to sufficient flights

... the current flight - you might choose customer A but who knows whether or not customer A is using free points for this flight which would associate more current revenue to Customer B

... future flights - this is even murkier since Customer A might be at the end of his career and/or moving to a non-traveling job while Customer B might be joining a consulting firm and will be sent on trips all over the globe during regular business days when traveling is most expensive

Of course, Customer A might cause a lot more stink since they have preferred status but they still might not actually be the most valuable to the companies current and/or future business success in many scenarios.

Hope this helps!

Cordially,
Doug

Ksharp · Posted 10-21-2016 03:35 AM

Or Check this:

Overview: PRINQUAL Procedure
The PRINQUAL procedure performs principal component analysis (PCA) of qualitative, quantitative, or
mixed data. PROC PRINQUAL is based on the work of Kruskal and Shepard (1974); Young, Takane, and

Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer

Re: Assigning weights to variables to calculate rank/score of a customer