The know how of a discriminant analysis?

chemicalab · Posted 11-28-2011 04:35 PM

Hi all,

I have performed a clustering analysis for some IDs with their characteristics and therefore been able to group each one of them into clusters.

I was thinking that i have to validate my model with discriminant analysis to see if the model catches a good ratio. After validating i would like to be able to classify new given data into the clusters. I have to say though I am a bit confused with the procedure cause i havent had much experience before but i would like learn how. So here i go :

1) What would be the most efficient and clear SAS syntax for proc discriminant , i mean do i need all those options provided?

2) What part of the outputs should i center my focus on in order to better explain the procedure to someone else?

So ok since i validate my model with already known outcomes (meaning cluster number at one column) i can understand a bit what the model is comparing to, my big confusion is when i have new given data without known outcome and i need to put my new ids into my given clusters. I mean how does that exactly work, sorry if i sound stupid about it but i am stuck on how that occurs. Would it be easy to demonstrate with a set with known outcomes and one without so i could get it better (even how to run the procedure). Do i have to run a similar procedure like when i validate?

many thnx in advance

Rick_SAS · Posted 11-28-2011 05:02 PM

Regarding scoring new data, you might consider using the centroid of each cluster as "classification points." For new data, assign the category for which the distance from the observation to a cluster center is smallest.

If you insist on learning about discriminant analysis, I have two suggestions:

1) Always start with the Getting Started example in the doc. You might even try PROC CANDISC, since canonical discriminant analysis is simpler and will seem familiar if you know principal component analysis.

2) The best quick intro to the discriminant procedures is actually in the SAS/IML Studio User's Guide: http://support.sas.com/documentation/cdl/en/imlsug/64254/HTML/default/viewer.htm#ugmultcda_toc.htm

Read the Overview, Getting Started. To get a short list of the "most important" options, see the description of the Method, Tables, and Output Variables tabs. The descriptions say "this checkbox corresponds to the XYZ option." The options mentioned in these dialog boxes are the most important.

chemicalab · Posted 12-02-2011 10:15 AM

Hi Rick,

Thnak you for your reply, it was insightful, could i ask you something else?

What is the preferable dataform for the input variables, should they be coded into binary form or the original raw form they had?

thnx in advance

Rick_SAS · Posted 12-02-2011 10:22 AM

In the original coordinates.

But to be honest, I'm not sure what you mean by "binary form." The original coordinates are continuous (interval), right? So the clustering is a subdivision of the variable space into k disjoint regions, each region corresponding to a cluster. I don't see any binary variables in this problem.

chemicalab · Posted 12-02-2011 10:31 AM

yes most of them are continuous, some of them are in binary form (0,1) meaning the variables tha are used for discriminant analysis, the class variables (the clusters) are in form 1-9.

Rick_SAS · Posted 12-02-2011 11:05 AM

Sorry, but I'm confused. You said in your original post that you did a clustering analysis. A clustering analysis is "unsupervised learning" and does not involve any response variables.

1) If you use the centroids of each cluster to score subsequent obs, then the entire analysis can be done in the original coordinates of the X vars.

2) If you use discriminant analysis to score subsequent obs (which I don't recommend), then you have a single categorical response variable with values 1-9 that you use to classify future observations.

In either case, there are no binary variables that I can see.

chemicalab · Posted 12-04-2011 05:56 AM

I did some research but could use a confirmation.

1) Cluster procedure:

proc cluster data = /* WHICH DATA */

method=ward /* WHAT LINKAGE */

outtree=

ccc pseudo print=15/* OUTPUT DATA INCLUDING SOLUTION */;

id CHOSEN ID /*VAR THAT WILL BE CLUSTERED */;

var /*VARS TO BE CLUSTERED */;

run;

/*Produce tree so as to see the shape of solution */

ods graphics on;

proc tree nclusters=5 /* HOW MANY CLUSTERS */

data= /* OUTPUT DATA OF STEP 1 IS INPUT DATA IN STEP 2 */

out=/* FINAL DATA INCLUDING SOLUTION */;

id chosen ID;

run;

ods graphics off;

proc sort data =

out=;

by Cluster;

run;

proc means data =;

output out=;

run;

proc fastclus data=<newdata>

maxclusters=<nclusters>

seed=<centroids>

maxiter=0 out=<scored>

My question here is how the proc means syntax should be cause it doesnt make sense to use the output from the tree that contains only the cluster and ID.

2) Fastclus

proc fastclus data=

outstat=

maxclusters=5;

var ;

id chosen ID;

run;

proc means data=

proc fastclus instat=

out=score data;

run;

Same here as above regarding the proc means syntax, what should be my input?

Thnx in advance