BookmarkSubscribeRSS Feed
chemicalab
Fluorite | Level 6

Hi all,

I have performed a clustering analysis for some  IDs with their characteristics and therefore been able to group each one of them into  clusters.

I was thinking that i have to validate my model with  discriminant analysis to see if the model catches a good ratio. After validating i would like to be able to  classify new given data into the clusters. I have to say though I am a bit confused with the procedure cause i havent had much experience before but i would like learn how.  So here i go :

1) What would be the most efficient and clear SAS syntax for proc discriminant , i mean do i need all those options provided?

2) What part of the outputs should i center my focus on in order to better explain the procedure to someone else?

So ok since i validate my model with already known outcomes (meaning cluster number at one column) i can understand a bit what the model is comparing to, my big confusion is when i have new given data without known outcome and i need to put my new ids into my given clusters. I mean how does that exactly work,  sorry if i sound stupid about it but i am  stuck on how that occurs. Would it be easy to demonstrate with a set with known outcomes and one without so i could get it better (even how to run the procedure). Do i have to run a similar procedure like when i validate?

many thnx in advance

6 REPLIES 6
Rick_SAS
SAS Super FREQ

Regarding scoring new data, you might consider using the centroid of each cluster as "classification points." For new data, assign the category for which the distance from the observation to a cluster center is smallest.

If you insist on learning about discriminant analysis, I have two suggestions:

1) Always start with the Getting Started example in the doc. You might even try PROC CANDISC, since canonical discriminant analysis is simpler and will seem familiar if you know principal component analysis.

2) The best quick intro to the discriminant procedures is actually in the SAS/IML Studio User's Guide: http://support.sas.com/documentation/cdl/en/imlsug/64254/HTML/default/viewer.htm#ugmultcda_toc.htm

Read the Overview, Getting Started. To get a short list of the "most important" options, see the description of the Method, Tables, and Output Variables tabs. The descriptions say "this checkbox corresponds to the XYZ option."  The options mentioned in these dialog boxes are the most important.

chemicalab
Fluorite | Level 6

Hi Rick,

Thnak you for your reply, it was insightful, could i ask you something else?

What is the preferable dataform for the input variables, should they be coded into binary form or the original raw form they had?

thnx in advance

Rick_SAS
SAS Super FREQ

In the original coordinates. 

But to be honest, I'm not sure what you mean by "binary form." The original coordinates are continuous (interval), right? So the clustering is a subdivision of the variable space into k disjoint regions, each region corresponding to a cluster. I don't see any binary variables in this problem.

chemicalab
Fluorite | Level 6

yes most of them are continuous, some of them are in binary form (0,1) meaning the variables tha are used for discriminant analysis, the class variables (the clusters) are in form 1-9.

Rick_SAS
SAS Super FREQ

Sorry, but I'm confused.  You said in your original post that you did a clustering analysis. A clustering analysis is "unsupervised learning" and does not involve any response variables.

1) If you use the centroids of each cluster to score subsequent obs, then the entire analysis can be done in the original coordinates of the X vars.

2) If you use discriminant analysis to score subsequent obs (which I don't recommend), then you have a single categorical response variable with values 1-9 that you use to classify future observations.

In either case, there are no binary variables that I can see.

chemicalab
Fluorite | Level 6

I did some research but could use a confirmation.

1) Cluster procedure:

proc cluster data = /* WHICH DATA */

             method=ward /* WHAT LINKAGE */

             outtree=

  ccc pseudo print=15/* OUTPUT DATA INCLUDING SOLUTION */;

             id CHOSEN ID /*VAR THAT WILL BE CLUSTERED */;

var   /*VARS TO BE CLUSTERED */;

run;

/*Produce tree so as to see the shape of solution */

ods graphics on;

proc tree nclusters=5 /* HOW MANY CLUSTERS */

          data= /* OUTPUT DATA OF STEP 1 IS INPUT DATA IN STEP 2 */

          out=/* FINAL DATA INCLUDING SOLUTION */;

id chosen ID;

run;

ods graphics off;

proc sort data =

out=;

by Cluster;

run;

proc means data =;

output out=;

run;

proc fastclus data=<newdata>

maxclusters=<nclusters>

seed=<centroids>

maxiter=0 out=<scored>

My question here is how the proc means syntax should be cause it doesnt make sense to use the output from the tree that contains only the cluster and ID.

2) Fastclus

proc fastclus data=

outstat=

maxclusters=5;

var ;

id chosen ID;

run;

proc means data=

proc fastclus instat=

out=score data;

run;

Same here as above  regarding the proc means syntax, what should be my input?

Thnx in advance

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1391 views
  • 0 likes
  • 2 in conversation