Desktop productivity for business analysts and programmers

Segmentation of a huge datamart

Reply
Occasional Contributor
Posts: 14

Segmentation of a huge datamart

Hi,

Here is the deal :

I have a 15 Million Lines and 500 Variables which makes the huge dataset.

I Want to make a behavioral segmentation.

First, i have to choose the variables that are most significant to have just the essential elements and then proceed by k-means for segmentation.

How can i choose the significant variables?

New Contributor HE
New Contributor
Posts: 4

Re: Segmentation of a huge datamart

a discriminant analysis  on a random sample will be usefull to keep relevant variables, start by using PROC STEPDISC.

Occasional Contributor
Posts: 14

Re: Segmentation of a huge datamart

Thank you, i'm testing it, i'll get back to you if i have any further questions Smiley Wink

Occasional Contributor
Posts: 14

Re: Segmentation of a huge datamart

I got another issue :

I do not have a dependant variable. It's just a list of 500 variables.

Any ideas on how to do the selection?

Respected Advisor
Posts: 2,655

Re: Segmentation of a huge datamart

No code, but some ideas.

  1. Subset the huge dataset.  A 1% random sample would probably do.
  2. Use PROC VARCLUS to see how the 500 variables cluster.
  3. Identify key variables from a business rule perspective within each variable cluster.
  4. Use those variables in PROC FASTCLUS on the full dataset to get your k-means clustering.

If you have access to Enterprise Miner, then a lot of other techniques become available, most of which have the word "tree" in their name.

Steve Denham

Occasional Contributor
Posts: 14

Re: Segmentation of a huge datamart

Thank you very much, I'll get on it.

N/A
Posts: 1

Re: Segmentation of a huge datamart

Hope you have sorted your problem with methods described above.

Just wondering what types of variables you have and did you also try factor analysis and MODECLUS?

I had same problem with no. of significant variables, so curious to know which technique was most useful.

Occasional Contributor
Posts: 14

Re: Segmentation of a huge datamart

Varsha,

I am going to use SteveDenham idea, it's very logical and seems that it would work.

I am still on some other tasks that take memory as well. I tried it on another laptop and works just fine.

Proc varclus to see how the variables cluster and then from a business perspective i chose the one i judged important from each cluster and some others and then i added other ones even though they didn't show much in the clustering but they are necessary for this exercise.

Hope i won't run into any trouble, in that case i'll be back to bother you guys

good day to ye !

Ask a Question
Discussion stats
  • 7 replies
  • 399 views
  • 3 likes
  • 4 in conversation