BookmarkSubscribeRSS Feed
Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

I want to do customer behavior segmentation base on shopping pattern but I am having trouble on how to handle the departmental sales data.  I will be doing variable standardization and dimension reduction using Factor Analysis/Principal component Analysis prior to clustering.

 

Should I:

 

A: The customer total sales and the %breakdown by each departments.

 

Customer_ID, total sales($), Department1_%sales, Department2_%sales, Department3_%sales.

 

or B: The sales of each department in dollar figures.

 

Customer_ID, Department1_sales, Department2_sales, Department3_sales.

 

 

 

13 REPLIES 13
PaigeMiller
Diamond | Level 26

What do you want your dimension reduction to use, the percent in each department or the actual dollar amount? I don't think we can answer this here, its up to you.

--
Paige Miller
Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

What is the pro and con on each?  Will it create a lot of issues if I include all of them and just feed in to proc factor?

PaigeMiller
Diamond | Level 26

If you use percents, then a customer with numbers like this 

 

Dept 1 sales: 500.00

Dept 2 sales: 1000.00

Dept 3 sales: 1500.00

 

is treated the same as a customer with numbers like this

 

Dept 1 sales: 5.00

Dept 2 sales: 10.00

Dept 3 sales: 15.00

 

Is that what you want, or not?

--
Paige Miller
Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

Is there any other way to capture those information?

PaigeMiller
Diamond | Level 26

@Fae wrote:

Is there any other way to capture those information?


I don't know what you mean by this.

 

And you didn't answer my question from my last post.

 

But I think it's time to call a pause here and get back to what @ballardw and I have been saying. We cannot (CANNOT) determine the goals of your analysis. You (and the people on your project) have to determine the goals of the analysis. Once you accomplish that, and explain to us what the goals of the analysis are, and we are all clear about where we are going, then we can help you select the proper statistical approach and SAS method.

--
Paige Miller
Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

let me say thanks for both of you: @PaigeMiller @ballardw

 

The total sales will be most important, so I will put in actual dollar amount.

 

 

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department  = 100%.  Wouldn't they be inversely correlated with each other?  

 

Also, for the example you given, since I included total sales: wouldn't the customer still be treated different since their total sales is different?

PaigeMiller
Diamond | Level 26

@Fae wrote:

 

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department  = 100%.  Wouldn't they be inversely correlated with each other?  

 


I'm not really sure what this question means. "Wouldn't they be ... " but the word "they" doesn't have a clear meaning here.

 

Also, for the example you given, since I included total sales: wouldn't the customer still be treated different since their total sales is different?

Do you mean: are they different if you put the actual dollar amount in? Or do you mean: are they different if you use % of sales in your previous question?

 

--
Paige Miller
Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

My question is that since %sales_Dep1 +  %sales_Dep2  + %sales_Dep3 = 1, wouldn't they will be linear dependent and will that cause issues?  Or will proc factor fix the linear dependency?

 

And ignoring the possible linear dependency for now, is this a good structure?

Total_Sales, %sales_Dep1, %sales_Dep2%sales_Dep3

ballardw
Super User

@Fae wrote:

let me say thanks for both of you: @PaigeMiller @ballardw

 

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department  = 100%.  Wouldn't they be inversely correlated with each other?  

 


 

I think you might mean "linearly dependent" instead of inverse unless your customers only have a choice of two departments to purchase from. If a customer purchased from 5 departments then any ons is dependent on the values of the other 4. In which case you might want to drop one department (consistently across all customers - same one) if using percentage that way.

 

Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

Yes I mean "linearly dependent".

 

 

I will say the main objective is to segment customers with total sales the most important and what department they spend in next.

 

Let say I have 5 department in total, so the variable I have will be:

 

Total_Sales
Dept1_sales to Dept5_sales


Total_QTY
Dept1_QTY to Dept5_QTY

 

Total_Avg_price

Dept1_Avg_price to Dept5_Avg_price


Pct_Dept1_sales to Pct_Dept5_sales
Pct_Dept1_QTY to Pct_Dept5_QTY

where

sales = QTY * Avg_price for both Total and Dept1-dept5

Sum(all Pct_Dept#_sales) = 1

Sum(all Pct_Dept#_QTY)  = 1

 

Let say I select the following variables:   Dept1_sales, Dept2_sales, Dept3_sales, Dept4_sales, Dept5_sales, Total_QTY

 

 

Then i standardize them with proc std(range) and then use proc factor(method=principal rotate=varimax) to eliminate colinearity, will it be a good start?

 

If the variable list isn't good, how about the list below?

Total_Sales, Pct_Dept1_sales to Pct_Dept4_sales (dropped Dept5), Total_Avg_price

 

Thanks again.  @ballardw @PaigeMiller

 

ballardw
Super User

@Fae wrote:

Is there any other way to capture those information?


 

You might look at the documentation for Proc STDIZE as it has a large number of different methods to standardize data that have a number of different ways of dealing with ranges or distribution of data.

Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

Let me say thank again to both of you. @ballardw @PaigeMiller

 

By documentation, u mean this?  https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_stdize_sect...

 

 

In the example, is weight cube rooted because weight is a function of (length * height * width).  So since sales = price * QTY, should i just square root it and use sqrt(sales) to re-scale other variable?

 

Also, I understand that length1, length2 and length3 are highly correlated so we don't need all.  But why do we only use length1 and logLengthRatio?

 

Let say if there was 5 length measurement and also have Length4 (length from nose to eye) and length5 (lenght from nose to gill).

Are we still going to use 2 length measurement?   Will we still want just 1 re-scaled length and 1 logLengthRatio?

ballardw
Super User

The choices of standardization would depend on what the actual analysis question(s) could be.

 

I suspect if you spend a little time getting that rigorously defined the appropriate choice may be more obvious.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 1788 views
  • 0 likes
  • 3 in conversation