BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tebert
Obsidian | Level 7

I want to calculate the cannonical varibles from the raw data. How do I do this? I get Proc Discrim to print the value, but I cannot take the raw data and get the same value that SAS gives me.

 

Thank you for the help.

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

When you score a new observation, it does not redo the analysis as if the new observation had been included. It merely evaluates the model on the new observation.  

 

To score a new observation: (centered by the mean of the original data) by the If that is the process you want to carry out, then the raw canonical coefficients are one way to score the new data. The documentation link I sent shows two other equivalent methods that are equivalent.

1. center the observation by subtracting the mean of the original data (coordinate by coordinate)

2. Perform matrix multiplication of the centered observation times the raw canonical coefficients.

 

The documentation link I sent show the formula for this computation, as well as two other equivalent computations.

 

 

View solution in original post

9 REPLIES 9
PGStats
Opal | Level 21

The canonical variables coefficients are output to a dataset with option outstat= in proc candisc. Look at observations with _TYPE_ = "SCORE". Note that the raw data must be standardized before the coefficients are applied, as explained in the Output Data Sets documentation of proc candisc.

PG
tebert
Obsidian | Level 7
I use the outstat= option and I get a new data set. I can calculate the means for each treatment and the standard deviation. The answers I get calculating by hand match what I get from SAS. However, PSTD doesn't match. SAS gives 8.1381977. If I calculate the standard deviation for all the data I get 9.8188. If I calculate the standard deviation for each treatment and average I get 8.130212. I tried normalizing (x-mean)/sd but that did not come close. I am doing something stupid wrong, but I just don't see it.
Rick_SAS
SAS Super FREQ

Make sure you are using the same denominator as SAS when you do the hand computations. There is the n vs (n-1) issue, and if you are using a FREQ variable or WEIGHT variable the denominator changes.

 

Why not use a standard SAS data set such as Sashelp.class? Then you can share your code and calculations, and we can follow along and correct any errors in the calculation.

tebert
Obsidian | Level 7

I didn't know about this. Great idea. Here is the program.

Proc discrim data=sashelp.iris crosslisterr distance ANOVA canonical outstat=good;

class species;

var SepalLength SepalWidth PetalLength PetalWidth;

run;

 

I look in the dataset "good" and I find "pstd" which I am assuming is the pooled standard deviation. The value in SAS is 5.14789 for SepalLength. How is this calculated?

Try 1) The overall mean for SepalLength is 58.43333 (row 5). The standard deviation for this mean is 8.280661 (STD in row 66). This isn't correct.

Try 2) The std for Setosa is 3.5248968 for Versicolor is 5.16171147 and Virginica is 6.3587959 (rows 61-63). The mean of these three values is 5.015135 (calculated in Excel). This too is not correct.

Try 3) ?

Rick_SAS
SAS Super FREQ

Add the PCOV option to the PROC DSCRIM statement. Then the procedure will create the "Pooled Within-Class Covariance Matrix" (and also add that matrix to the OUTSTAT= data set).

 

The numbers in the PSTD row are the square-root of the diagonal elements of the pooled within-class covariance matrix. The formula for the covariance matrix is available in the SAS documentation for the CANDISC procedure.

tebert
Obsidian | Level 7

Ok, I have gotten side tracked.

Here is the revised program.

Proc discrim data=sashelp.iris pcov crosslisterr distance ANOVA canonical outstat=good out=goo;

class species;

var SepalLength SepalWidth PetalLength PetalWidth;

run;

 

In dataset "goo" I find two new variables can1 and can2. Is there part of the output (or some missing output) that I could use to calculate can1 based only on the measures provided in the iris dataset (SepalLength SepalWidth PetalLength PetalWidth)?

Rick_SAS
SAS Super FREQ

Yes, and the same documentation link that I sent tells you how to get it.  You use the RAWSCORES in the GOOD data set.

To get the canonical scores, use matrix multiplication with the centered data measurements and the raw scores.  If you have PROC IML, it looks like this:

proc print data=good;
where _TYPE_="RAWSCORE";
run;

proc iml;
use Goo;
read all var {SepalLength SepalWidth PetalLength PetalWidth} into X; /* data matrix */
read all var {Can1 Can2} into Can; /* canonical scores */
close;

use Good where( _TYPE_="RAWSCORE" );
read all var {SepalLength SepalWidth PetalLength PetalWidth} into R; /* scoring coeficients */
close;

Score = (X-mean(X))*R`;  /* should be same as [Can1 Can2] */

/* check that the Score equals the values in [Can1 Can2] */
maxDiff = max(abs(Score-Can));
print maxDiff;      /* prints 1.776E-15, which shows that the values are equal */

 

You can also use PROC SCORE to confirm the computations.

 

 

tebert
Obsidian | Level 7

I guess I asked the question in the wrong way.

 

So lets say that I went out into the woods of Virginia and found a new iris plant. I took four measurements that were SepalLength, SepalWidth, PetalLength, and PetalWidth. I read a manuscript on how to classify my new iris, and it gave a table of (???) from which I was able to calculate what can1 and can2 would have been had my new observation been included (and assuming that the new observation fit into the existing data). Since the raw data were not published in the manuscript I don't have an option of redoing the analysis. I can now plot my new value onto the published graph and decide what plant I have found.

 

I had thought that the raw canonical coefficients were what I needed for "???" but that didn't seem to work. When I used them I did not get the value that SAS gave for Can1. 

Rick_SAS
SAS Super FREQ

When you score a new observation, it does not redo the analysis as if the new observation had been included. It merely evaluates the model on the new observation.  

 

To score a new observation: (centered by the mean of the original data) by the If that is the process you want to carry out, then the raw canonical coefficients are one way to score the new data. The documentation link I sent shows two other equivalent methods that are equivalent.

1. center the observation by subtracting the mean of the original data (coordinate by coordinate)

2. Perform matrix multiplication of the centered observation times the raw canonical coefficients.

 

The documentation link I sent show the formula for this computation, as well as two other equivalent computations.

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1693 views
  • 7 likes
  • 3 in conversation