Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Need help in Data preparation for Cluster Analysis

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 07-20-2018 11:32 AM
(1821 views)

I want to do customer behavior segmentation base on shopping pattern but I am having trouble on how to handle the departmental sales data. I will be doing variable standardization and dimension reduction using Factor Analysis/Principal component Analysis prior to clustering.

Should I:

A: The customer total sales and the %breakdown by each departments.

Customer_ID, total sales($), Department1_%sales, Department2_%sales, Department3_%sales.

or B: The sales of each department in dollar figures.

Customer_ID, Department1_sales, Department2_sales, Department3_sales.

13 REPLIES 13

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

What do you want your dimension reduction to use, the percent in each department or the actual dollar amount? I don't think we can answer this here, its up to you.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If you use percents, then a customer with numbers like this

Dept 1 sales: 500.00

Dept 2 sales: 1000.00

Dept 3 sales: 1500.00

is treated the same as a customer with numbers like this

Dept 1 sales: 5.00

Dept 2 sales: 10.00

Dept 3 sales: 15.00

Is that what you want, or not?

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Is there any other way to capture those information?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Fae wrote:

Is there any other way to capture those information?

I don't know what you mean by this.

And you didn't answer my question from my last post.

But I think it's time to call a pause here and get back to what @ballardw and I have been saying. We cannot (CANNOT) determine the goals of your analysis. You (and the people on your project) have to determine the goals of the analysis. Once you accomplish that, and explain to us what the goals of the analysis are, and we are all clear about where we are going, then we can help you select the proper statistical approach and SAS method.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

let me say thanks for both of you: @PaigeMiller @ballardw

The total sales will be most important, so I will put in actual dollar amount.

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department = 100%. Wouldn't they be inversely correlated with each other?

Also, for the example you given, since I included total sales: wouldn't the customer still be treated different since their total sales is different?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Fae wrote:

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department = 100%. Wouldn't they be inversely correlated with each other?

I'm not really sure what this question means. "Wouldn't they be ... " but the word "they" doesn't have a clear meaning here.

Also, for the example you given, since I included total sales: wouldn't the customer still be treated different since their total sales is different?

Do you mean: are they different if you put the actual dollar amount in? Or do you mean: are they different if you use % of sales in your previous question?

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

My question is that since %sales_Dep1 + %sales_Dep2 + %sales_Dep3 = 1, wouldn't they will be linear dependent and will that cause issues? Or will proc factor fix the linear dependency?

And ignoring the possible linear dependency for now, is this a good structure?

Total_Sales, %sales_Dep1, %sales_Dep2, %sales_Dep3

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Fae wrote:

let me say thanks for both of you: @PaigeMiller @ballardw

A bit off question, if instead I want to use %sales by department, since the sum of %sales of all department = 100%. Wouldn't they be inversely correlated with each other?

I think you might mean "linearly dependent" instead of inverse unless your customers only have a choice of two departments to purchase from. If a customer purchased from 5 departments then any ons is dependent on the values of the other 4. In which case you might want to drop one department (consistently across all customers - same one) if using percentage that way.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Yes I mean "linearly dependent".

I will say the main objective is to segment customers with total sales the most important and what department they spend in next.

Let say I have 5 department in total, so the variable I have will be:

Total_Sales

Dept1_sales to Dept5_sales

Total_QTY

Dept1_QTY to Dept5_QTY

Total_Avg_price

Dept1_Avg_price to Dept5_Avg_price

Pct_Dept1_sales to Pct_Dept5_sales

Pct_Dept1_QTY to Pct_Dept5_QTY

where

sales = QTY * Avg_price for both Total and Dept1-dept5

Sum(all Pct_Dept#_sales) = 1

Sum(all Pct_Dept#_QTY) = 1

Let say I select the following variables: Dept1_sales, Dept2_sales, Dept3_sales, Dept4_sales, Dept5_sales, Total_QTY

Then i standardize them with proc std(range) and then use proc factor(method=principal rotate=varimax) to eliminate colinearity, will it be a good start?

If the variable list isn't good, how about the list below?

Total_Sales, Pct_Dept1_sales to Pct_Dept4_sales (dropped Dept5), Total_Avg_price

Thanks again. @ballardw @PaigeMiller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Fae wrote:

Is there any other way to capture those information?

You might look at the documentation for Proc STDIZE as it has a large number of different methods to standardize data that have a number of different ways of dealing with ranges or distribution of data.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Let me say thank again to both of you. @ballardw @PaigeMiller

By documentation, u mean this? https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_stdize_sect...

In the example, is weight cube rooted because weight is a function of (length * height * width). So since sales = price * QTY, should i just square root it and use sqrt(sales) to re-scale other variable?

Also, I understand that length1, length2 and length3 are highly correlated so we don't need all. But why do we only use length1 and logLengthRatio?

Let say if there was 5 length measurement and also have Length4 (length from nose to eye) and length5 (lenght from nose to gill).

Are we still going to use 2 length measurement? Will we still want just 1 re-scaled length and 1 logLengthRatio?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The choices of standardization would depend on what the actual analysis question(s) could be.

I suspect if you spend a little time getting that rigorously defined the appropriate choice may be more obvious.

**Don't miss out on SAS Innovate - Register now for the FREE Livestream!**

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.