BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JavierAbalo
Calcite | Level 5

Hello everybody,

I ran a discriminant analysis to classify cases into 8 groups. The solution that best worked was a NPAR with k=4.

proc discrim data=sasuser.seg_analysis_v3

                out=classification_npar

                outstat=stats

                outd=densities

                method=npar pool=yes k=4

                crossvalidate

                anova stdmean manova distance posterr;

class clust;

priors proportional;

var x1 x2 x3 x4 x5 x6 x7 x8 x9 x10;

run;

I want to classify new cases outside SAS and outside any statistical package. That would have been an easy task with a parametric method (linear/quadratic), but it doesn't seem the same for a nearest neighbor method.

Question: How can I classify new cases outside SAS (or any stats package) after a non-paremetric (k-neighbor) discriminant analysis?

Thanks for your time. Any help is appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
JavierAbalo
Calcite | Level 5

I got a clear answer from SAS. Summarizing, it is not possible to score new cases outside SAS with a NPAR discriminant model created in SAS.

I reproduce SAS response:

"Nonparametric methods do not produce parameters, by definition, so you cannot get an equivalent to the OUTSTAT= data set when you use METHOD=NORMAL. The is no closed-form solution for this and the entire DATA= data set is required for classification each time. In other words, there is not way to generate summary output from a nonparametric discrimination analysis that can be used to classify new observations. You have to use PROC DISCRIM to classify the new obs"

View solution in original post

6 REPLIES 6
Reeza
Super User

I don't know the answer, but I'd suggest searching for a methodological solution (math solution) and then someone can help you find out how to implement that in SAS.

Theoretically, if it gave you the centroids you could then calculate the distance and then create some probability? I'd have to reference my texts which aren't with me at the moment.

EDIT: EM 7.1 allows you to score new data based on k-means clustering so its possible methodologically. You need the centroids first off, and then the probability of being what case based on each centroid.  Then you calculate how close each new case is to the centroids.  Now how to get that out of SAS 🙂

JavierAbalo
Calcite | Level 5

Thank you Reeza,

I agree with you in that the methodological solution is the way to go. I have gone through SAS documentation but haven't been lucky with my tests (see the non-pararametric methods section below)

http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_discrim_sect...

To your point with EM, we can also score new cases in STAT with the TESTDATA= option. It is definitely possible, but I need to do it "manually" outside SAS (or even in SAS before translating to SQL).

I will keep looking.

Thanks much for your help.

1zmm
Quartz | Level 8

I don't know the answer either since you seem to be restricted to scoring new cases outside SAS.

Even with nonparametric discriminant analysis, PROC DISCRIM (using your code above) can write to SAS data sets both the original observations, variables, and group to which the procedure assigns the observation and the overall and group-specific statistics (means, standard deviations, and numbers of observartions).   PROC DISCRIM also writes a table with the parametric statistic, R-squared/(1-R-squared), that can help you identify which variables best discriminate among the individual groups.

You might try different methods to score new cases outside SAS:

   1). Create a centroid "observation" for each group based on the group-specific means for all variables (for example, eight or nine

            groups in your example, the ninth group being those not classified into any of the other eight groups).  Then, interleave your

            new observations among these centroid observations by sorting on the variables used.  New observations "closest" to

            a group-specific centroid are classified with that group.

   2) Create a centroid "observation" for each group based on the group-specific means for the selected variables that best discriminate

           among the individual groups.  Interleave as in method #1.

   3)  For methods #1 and #2, create such centroid "observations" for each group based on standardized values (the deviation

           of the group-specific mean from the mean across all groups divided by the group-specific standard deviations).  Standardized

           the new cases in the same way, and interleave as before.

   4)  Sort the observations that PROC DISCRIM assigns to the same group together.  Then interleave the new observations within

           these groups of observations.  New observations always or "mostly" surrounded by old observations in the same group are

           assigned to that group.   New observations split among observations from different groups are assigned to an indeterminate

           group.   You again can use all the original variables used to classify the observations into groups or only the variables that best

           descriminate among these groups.

The first two and the fourth methods only require sorting observations by the variable values.  The third method does require some manipulation (standardization) befor sorting.

JavierAbalo
Calcite | Level 5

Thank you 1zmm and PGStats.

PGStats, I appreciate the suggestion to use of DT instead, but in this case is more of a methodological issue trying to solve this conundrum of scoring new cases after k-neighbors discrimiant.

1zmm, I built a small dataset and ran some tests following your advice. I cannot match my results with those I get in SAS scoring.

I have been researching more on the topic (see link in previous post) and there is one critical issue I must solve: How to estimate group-specific densities for each new case. Quoting SAS, "Nonparametric discriminant methods are based on nonparametric estimates of group-specific probability densities". After densities are estimated "either Mahalanobis or Euclidean distance can be used to determine proximity. When the k-nearest-neighbor method is used, the Mahalanobis distances are based on the pooled covaraince matrix".

More specifically, from SAS manual referenced above

Density estimation.png

When scoring new cases in SAS with the TESTDATA= option, I can see the group-specific calculated densities for each case. I have been unable to match those numbers "manually".

This is going out of scope for me. I need to find a way to do this efficiently with the output provided by SAS in the analysis (i.e., pooled covariance matrix, square distances, group means, etc.). I will be asking SAS support and will post any successful answer I get (if any).

I very much appreciate all your comments. Thank you

JavierAbalo
Calcite | Level 5

I got a clear answer from SAS. Summarizing, it is not possible to score new cases outside SAS with a NPAR discriminant model created in SAS.

I reproduce SAS response:

"Nonparametric methods do not produce parameters, by definition, so you cannot get an equivalent to the OUTSTAT= data set when you use METHOD=NORMAL. The is no closed-form solution for this and the entire DATA= data set is required for classification each time. In other words, there is not way to generate summary output from a nonparametric discrimination analysis that can be used to classify new observations. You have to use PROC DISCRIM to classify the new obs"

PGStats
Opal | Level 21

Hi, you might get a more useful non-parametric classification model by fitting a partition tree to your data. Partition (also called decision) trees are easier to interpret and use than discriminant models. JMP has a nice interface for building simple partition models. If you have access to JMP, you should give it a try.

hth

PG

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2522 views
  • 3 likes
  • 4 in conversation