BookmarkSubscribeRSS Feed
M_A_C
Calcite | Level 5

Hi,

Here is the deal :

I have a 15 Million Lines and 500 Variables which makes the huge dataset.

I Want to make a behavioral segmentation.

First, i have to choose the variables that are most significant to have just the essential elements and then proceed by k-means for segmentation.

How can i choose the significant variables?

7 REPLIES 7
HE
Calcite | Level 5 HE
Calcite | Level 5

a discriminant analysis  on a random sample will be usefull to keep relevant variables, start by using PROC STEPDISC.

M_A_C
Calcite | Level 5

Thank you, i'm testing it, i'll get back to you if i have any further questions Smiley Wink

M_A_C
Calcite | Level 5

I got another issue :

I do not have a dependant variable. It's just a list of 500 variables.

Any ideas on how to do the selection?

SteveDenham
Jade | Level 19

No code, but some ideas.

  1. Subset the huge dataset.  A 1% random sample would probably do.
  2. Use PROC VARCLUS to see how the 500 variables cluster.
  3. Identify key variables from a business rule perspective within each variable cluster.
  4. Use those variables in PROC FASTCLUS on the full dataset to get your k-means clustering.

If you have access to Enterprise Miner, then a lot of other techniques become available, most of which have the word "tree" in their name.

Steve Denham

M_A_C
Calcite | Level 5

Thank you very much, I'll get on it.

Varsha
Calcite | Level 5

Hope you have sorted your problem with methods described above.

Just wondering what types of variables you have and did you also try factor analysis and MODECLUS?

I had same problem with no. of significant variables, so curious to know which technique was most useful.

M_A_C
Calcite | Level 5

Varsha,

I am going to use SteveDenham idea, it's very logical and seems that it would work.

I am still on some other tasks that take memory as well. I tried it on another laptop and works just fine.

Proc varclus to see how the variables cluster and then from a business perspective i chose the one i judged important from each cluster and some others and then i added other ones even though they didn't show much in the clustering but they are necessary for this exercise.

Hope i won't run into any trouble, in that case i'll be back to bother you guys

good day to ye !

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1019 views
  • 3 likes
  • 4 in conversation