Re: Need help in Data preparation for Cluster Analysis

Fae · Posted 07-20-2018 11:32 AM

I want to do customer behavior segmentation base on shopping pattern but I am having trouble on how to handle the departmental sales data. I will be doing variable standardization and dimension reduction using Factor Analysis/Principal component Analysis prior to clustering.

Should I:

A: The customer total sales and the %breakdown by each departments.

Customer_ID, total sales($), Department1_%sales, Department2_%sales, Department3_%sales.

or B: The sales of each department in dollar figures.

Customer_ID, Department1_sales, Department2_sales, Department3_sales.

PaigeMiller · Posted 07-20-2018 11:36 AM

What do you want your dimension reduction to use, the percent in each department or the actual dollar amount? I don't think we can answer this here, its up to you.

--
Paige Miller

Fae · Posted 07-20-2018 11:42 AM

What is the pro and con on each? Will it create a lot of issues if I include all of them and just feed in to proc factor?

PaigeMiller · Posted 07-20-2018 11:50 AM

If you use percents, then a customer with numbers like this

Dept 1 sales: 500.00

Dept 2 sales: 1000.00

Dept 3 sales: 1500.00

is treated the same as a customer with numbers like this

Dept 1 sales: 5.00

Dept 2 sales: 10.00

Dept 3 sales: 15.00

Is that what you want, or not?

--
Paige Miller

Fae · Posted 07-20-2018 01:09 PM

Is there any other way to capture those information?

PaigeMiller · Posted 07-20-2018 01:37 PM

@Fae wrote:

Is there any other way to capture those information?

I don't know what you mean by this.

And you didn't answer my question from my last post.

But I think it's time to call a pause here and get back to what @ballardw and I have been saying. We cannot (CANNOT) determine the goals of your analysis. You (and the people on your project) have to determine the goals of the analysis. Once you accomplish that, and explain to us what the goals of the analysis are, and we are all clear about where we are going, then we can help you select the proper statistical approach and SAS method.

--
Paige Miller

Fae · Posted 07-23-2018 01:50 PM

let me say thanks for both of you: @PaigeMiller @ballardw

The total sales will be most important, so I will put in actual dollar amount.

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department = 100%. Wouldn't they be inversely correlated with each other?

Also, for the example you given, since I included total sales: wouldn't the customer still be treated different since their total sales is different?

PaigeMiller · Posted 07-23-2018 02:10 PM

@Fae wrote:

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department = 100%. Wouldn't they be inversely correlated with each other?

I'm not really sure what this question means. "Wouldn't they be ... " but the word "they" doesn't have a clear meaning here.

Also, for the example you given, since I included total sales: wouldn't the customer still be treated different since their total sales is different?

Do you mean: are they different if you put the actual dollar amount in? Or do you mean: are they different if you use % of sales in your previous question?

--
Paige Miller

Fae · Posted 07-23-2018 04:58 PM

My question is that since %sales_Dep1 + %sales_Dep2 + %sales_Dep3 = 1, wouldn't they will be linear dependent and will that cause issues? Or will proc factor fix the linear dependency?

And ignoring the possible linear dependency for now, is this a good structure?

Total_Sales, %sales_Dep1, %sales_Dep2, %sales_Dep3

ballardw · Posted 07-23-2018 02:14 PM

@Fae wrote:

let me say thanks for both of you: @PaigeMiller @ballardw

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department = 100%. Wouldn't they be inversely correlated with each other?

I think you might mean "linearly dependent" instead of inverse unless your customers only have a choice of two departments to purchase from. If a customer purchased from 5 departments then any ons is dependent on the values of the other 4. In which case you might want to drop one department (consistently across all customers - same one) if using percentage that way.

Fae · Posted 07-23-2018 06:06 PM

Yes I mean "linearly dependent".

I will say the main objective is to segment customers with total sales the most important and what department they spend in next.

Let say I have 5 department in total, so the variable I have will be:

Total_Sales
Dept1_sales to Dept5_sales

Total_QTY
Dept1_QTY to Dept5_QTY

Total_Avg_price

Dept1_Avg_price to Dept5_Avg_price

Pct_Dept1_sales to Pct_Dept5_sales
Pct_Dept1_QTY to Pct_Dept5_QTY

where

sales = QTY * Avg_price for both Total and Dept1-dept5

Sum(all Pct_Dept#_sales) = 1

Sum(all Pct_Dept#_QTY) = 1

Let say I select the following variables: Dept1_sales, Dept2_sales, Dept3_sales, Dept4_sales, Dept5_sales, Total_QTY

Then i standardize them with proc std(range) and then use proc factor(method=principal rotate=varimax) to eliminate colinearity, will it be a good start?

If the variable list isn't good, how about the list below?

Total_Sales, Pct_Dept1_sales to Pct_Dept4_sales (dropped Dept5), Total_Avg_price

Thanks again. @ballardw @PaigeMiller

ballardw · Posted 07-20-2018 02:12 PM

@Fae wrote:

Is there any other way to capture those information?

You might look at the documentation for Proc STDIZE as it has a large number of different methods to standardize data that have a number of different ways of dealing with ranges or distribution of data.

Fae · Posted 07-23-2018 02:41 PM

Let me say thank again to both of you. @ballardw @PaigeMiller

By documentation, u mean this? https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_stdize_sect...

In the example, is weight cube rooted because weight is a function of (length * height * width). So since sales = price * QTY, should i just square root it and use sqrt(sales) to re-scale other variable?

Also, I understand that length1, length2 and length3 are highly correlated so we don't need all. But why do we only use length1 and logLengthRatio?

Let say if there was 5 length measurement and also have Length4 (length from nose to eye) and length5 (lenght from nose to gill).

Are we still going to use 2 length measurement? Will we still want just 1 re-scaled length and 1 logLengthRatio?

ballardw · Posted 07-20-2018 11:55 AM

The choices of standardization would depend on what the actual analysis question(s) could be.

I suspect if you spend a little time getting that rigorously defined the appropriate choice may be more obvious.