05-28-2013 05:12 PM
Regarding PROC DISCRIM, I have several questions about the output. Let's consider SAS/STAT's example of Puranen's fish.
1) Every book on discriminant analysis that I've read states that the number of discriminant functions is equal to the number of classes in the target variable minus one. (e.g. Page 240, "Multivariate Data Analysis", 7th edition, Hair, Black, Babin, and Anderson)
In this fish example, there are 7 species, so there should be 6 discriminant functions. Why, then, are there 7 discriminant functions in Figure 31.4?
2) To use a discriminant function to assign an observation into a class, one must
- identify which classes are being distinguished by the discriminant function
- the cut-off value that is used to decide which discriminant scores map to which class
Where can I find these pieces of information in the output?
3) In Figure 31.5, I struggle to determine
- whether the rows or the columns denote the true species, and
- whether the rows or the columns denote the predicted species
By the "Error Count Estimates for Species" table, it seems like 3 Perch fish were misclassified as Smelt, so it seems like the rows (the labels listed from top to bottom on the left) denote the true species, and the columns (the labels listed from left to right on the top) denote the predicted species.
Am I right? Please help.
4) The CROSSVALIDATE option seems to allow only N-fold cross-validation, where N is the number of observations. Is it possible to do K-fold cross-validation with PROC DISCRIM, where K is some integer less than or equal to N?
5) How does the information in Figures 31.2 and 31.3 help to discriminate the observations? I struggle to see how the rank of the covariance matrix, the natural log of the determinant of the covariance matrix, or the squared distances between the classes are of any help in discriminating the observations.
Thanks for your help.
05-29-2013 10:32 AM
1) PROC DISCRIM parameterizes the discriminant functions differently from other programs. You can estimate the posterior probability of any given observation being in a particular "dependent variable" category using the discriminant function for that category and the prior probability of that category.
2) As the documentation states, PROC DISCRIM places each observation in the category from which it has the smallest generalized square distance based on the different discriminant functions. This distance is the squared Mahalanobis distance between the observation and the mean values of each category using the pooled covariance matrix and the prior probabilities for each category (in the fish example). The discriminant scores are -0.5*this distance for each category. For each observation, the OUT= data set includes these posterior probabilities by category, and the OUTD= data set includes the category-specific density estimates. You can also specify the PROC DISCRIM statement option, SCORES, to add the discriminant function scores for each categoy to the OUT= data set.
3) Yes, you are right. The rows in Figure 31.5 are the true species of the observation and the columns the species that the observation that the discriminant function classifies the observation into.
4) The CROSSVALIDATE option uses the other N-1 observations to classify any one observation. You can use the K=n or the KPROP=p options of the PROC DISCRIM statement to use only the observation's K-th nearest neighbors or the K-th proportion of observations to classify an observation into specific groups. I do not think that you can use a random sample of K observations to classify a specific observation.
5) Figure 31.1 describes the number of observations used, the actual probability of each category, and the default prior probability of each category in the discriminant analysis. The PRIORS statement allows you to change the prior probabilities from their default of being equal (that is, independent of the sample size in the categories). Changing this statement would change the estimated discriminant functions. Figure 31.2 shows the default information for the pooled covariance matrix, which is the basis of the measure of the measure of squared distance; you can change this basis to the individual within-category covariance matrices to calculate these distances (option, POOL=NO).
05-30-2013 09:14 PM
Thank you, 1zmm, for your detailed reply. SAS Support gave me some more useful answers.
1) It turns out that G, the number of discriminant functions, equals the K, number of classes in the target variable, but it is possible to reduce G to K - 1.
4) K-fold cross-validation, which involves splitting the training set into K partitions and rotating each partition as the validation set, is NOT available in PROC DISCRIM except for K = N, where N is the number of data. (i.e. only leave-one-out cross-validation is available.)
5) Figures 31.2 and 31.3 show statistics about the data set, but do not aid the building of the discriminant functions in any way. (Yes, you are right - changing the priors will change the discriminant functions, but that shows up in Figure 31.1, which I did not ask about.)
I have a new question.
7) How does PROC DISCRIM calculate the discriminant functions? The documentation does not provide the details. (Yes, I see how the discriminant scores are calculated, as 1zmm point out, but how are the discriminant functions calculated?) What makes this issue even more confusing is that "Overview: CANDISC PROCEDURE" states in the 4th paragraph that its method of calculating the canonical variables (or discriminant functions) is different from PROC DISCRIM. (I had no idea that there could be more than 1 way to calculate them!)
PROC CANDISC's documentation shows the details on how the canonical variables (discriminant functions) are calculated, but not so for PROC DISCRIM. Thus, once again, I ask: How does PROC DISCRIM calculate the discriminant functions?
Thanks in advance for reading and sharing your advice!
05-30-2013 10:06 PM
Proc DISCRIM output gives these equations for discriminant function J when pool=yes
Linear Discriminant Function
_ -1 _ -1 _
Constant = -.5 X' COV X + ln PRIOR Coefficient = COV X
j j j Vector j
05-31-2013 10:12 AM
I wrote the output in LaTeX for you in the attached file.
PGStats: Thank you.
a) How did you generate that output?
b) I think that I understand how the discriminant functions are calculated now now. Allow me to try to explain it.
The discriminant function is merely the generalized squared distance, which takes into account the possibilities that
- the covariance matrices may not be equal
- the prior probabilities may not be equal
See the PROC DISCRIM background page under "Parametric Methods" for more details.
Thus, the discrimination can be done by either
- looking for the class with the highest posterior probability as calculated by Bayes' rule
- looking for the class with the shortest generalized squared distance
Is my explanation correct?
05-31-2013 10:43 AM
a) I ran Example 31.4 here with the Listing output destination open.
b) Yes, except that the discriminant functions are quadratic when Pool=no, i.e. when the covariance matrices are not equal.
12-06-2015 03:06 PM
I'm not sure if anyone would still be looking at a post from years ago, but I've found this discussion thread very helpful in figuring out how to best use PROC DISCRIM. I would like to have SAS display the quadratic discriminant functions because the covariance matrices are not equal in my data. Quadratic functions are better at predicting correct groups than the linear functions, but when I specify POOL=NO in the command, the output no longer displays the discriminant functions. (Whereas POOL=YES, which is the default, gives me the linear functions, like in Table 31.4 in the fish example). I need to have the functions to reporting purposes, so that others can plug new data into the functions to classify individuals in the future.
Any help would be much appreciated. Thank you!
05-31-2013 01:36 PM
Thanks, PGStats, for your help with my Question #7.
I would appreciate anyone's help with a few more questions about model performance.
8) I can't find any way on the PROC DISCRIM syntax page to produce an ROC curve. This is available in JMP, so it seems strange that it's not available in SAS. Does PROC DISCRIM provide ROC curves?
9) What is the best way to assess the predictive accuracy of a discriminant analysis model? I like leave-one-out cross-validation, but I welcome your feedback/comments on other ways to assess its performance.