turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- How do I calculate cannonical variables in Proc Di...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-29-2017 08:36 PM

I want to calculate the cannonical varibles from the raw data. How do I do this? I get Proc Discrim to print the value, but I cannot take the raw data and get the same value that SAS gives me.

Thank you for the help.

Accepted Solutions

Solution

01-31-2017
02:40 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to tebert

01-30-2017 08:36 PM

When you score a new observation, it does not redo the analysis as if the new observation had been included. It merely evaluates the model on the new observation.

To score a new observation: (centered by the mean of the original data) by the If that is the process you want to carry out, then the raw canonical coefficients are one way to score the new data. The documentation link I sent shows two other equivalent methods that are equivalent.

1. center the observation by subtracting the mean of the original data (coordinate by coordinate)

2. Perform matrix multiplication of the centered observation times the raw canonical coefficients.

The documentation link I sent show the formula for this computation, as well as two other equivalent computations.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to tebert

01-29-2017 09:54 PM

The canonical variables coefficients are output to a dataset with option outstat= in proc candisc. Look at observations with _TYPE_ = "SCORE". Note that the raw data must be standardized before the coefficients are applied, as explained in the Output Data Sets documentation of proc candisc.

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

01-29-2017 10:55 PM

I use the outstat= option and I get a new data set. I can calculate the means for each treatment and the standard deviation. The answers I get calculating by hand match what I get from SAS. However, PSTD doesn't match. SAS gives 8.1381977. If I calculate the standard deviation for all the data I get 9.8188. If I calculate the standard deviation for each treatment and average I get 8.130212. I tried normalizing (x-mean)/sd but that did not come close. I am doing something stupid wrong, but I just don't see it.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to tebert

01-30-2017 08:39 AM

Make sure you are using the same denominator as SAS when you do the hand computations. There is the n vs (n-1) issue, and if you are using a FREQ variable or WEIGHT variable the denominator changes.

Why not use a standard SAS data set such as Sashelp.class? Then you can share your code and calculations, and we can follow along and correct any errors in the calculation.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Rick_SAS

01-30-2017 11:55 AM

I didn't know about this. Great idea. Here is the program.

**Proc** **discrim** data=sashelp.iris crosslisterr distance ANOVA canonical outstat=good;

class species;

var SepalLength SepalWidth PetalLength PetalWidth;

**run**;

I look in the dataset "good" and I find "pstd" which I am assuming is the pooled standard deviation. The value in SAS is 5.14789 for SepalLength. How is this calculated?

Try 1) The overall mean for SepalLength is 58.43333 (row 5). The standard deviation for this mean is 8.280661 (STD in row 66). This isn't correct.

Try 2) The std for Setosa is 3.5248968 for Versicolor is 5.16171147 and Virginica is 6.3587959 (rows 61-63). The mean of these three values is 5.015135 (calculated in Excel). This too is not correct.

Try 3) ?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to tebert

01-30-2017 01:11 PM

Add the PCOV option to the PROC DSCRIM statement. Then the procedure will create the "Pooled Within-Class Covariance Matrix" (and also add that matrix to the OUTSTAT= data set).

The numbers in the PSTD row are the square-root of the diagonal elements of the pooled within-class covariance matrix. The formula for the covariance matrix is available in the SAS documentation for the CANDISC procedure.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Rick_SAS

01-30-2017 03:50 PM

Ok, I have gotten side tracked.

Here is the revised program.

**Proc** **discrim** data=sashelp.iris pcov crosslisterr distance ANOVA canonical outstat=good out=goo;

class species;

var SepalLength SepalWidth PetalLength PetalWidth;

**run**;

In dataset "goo" I find two new variables can1 and can2. Is there part of the output (or some missing output) that I could use to calculate can1 based only on the measures provided in the iris dataset (SepalLength SepalWidth PetalLength PetalWidth)?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to tebert

01-30-2017 04:30 PM

Yes, and the same documentation link that I sent tells you how to get it. You use the RAWSCORES in the GOOD data set.

To get the canonical scores, use matrix multiplication with the centered data measurements and the raw scores. If you have PROC IML, it looks like this:

```
proc print data=good;
where _TYPE_="RAWSCORE";
run;
proc iml;
use Goo;
read all var {SepalLength SepalWidth PetalLength PetalWidth} into X; /* data matrix */
read all var {Can1 Can2} into Can; /* canonical scores */
close;
use Good where( _TYPE_="RAWSCORE" );
read all var {SepalLength SepalWidth PetalLength PetalWidth} into R; /* scoring coeficients */
close;
Score = (X-mean(X))*R`; /* should be same as [Can1 Can2] */
/* check that the Score equals the values in [Can1 Can2] */
maxDiff = max(abs(Score-Can));
print maxDiff; /* prints 1.776E-15, which shows that the values are equal */
```

You can also use PROC SCORE to confirm the computations.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Rick_SAS

01-30-2017 07:17 PM

I guess I asked the question in the wrong way.

So lets say that I went out into the woods of Virginia and found a new iris plant. I took four measurements that were SepalLength, SepalWidth, PetalLength, and PetalWidth. I read a manuscript on how to classify my new iris, and it gave a table of (???) from which I was able to calculate what can1 and can2 would have been had my new observation been included (and assuming that the new observation fit into the existing data). Since the raw data were not published in the manuscript I don't have an option of redoing the analysis. I can now plot my new value onto the published graph and decide what plant I have found.

I had thought that the raw canonical coefficients were what I needed for "???" but that didn't seem to work. When I used them I did not get the value that SAS gave for Can1.

Solution

01-31-2017
02:40 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to tebert

01-30-2017 08:36 PM

When you score a new observation, it does not redo the analysis as if the new observation had been included. It merely evaluates the model on the new observation.

To score a new observation: (centered by the mean of the original data) by the If that is the process you want to carry out, then the raw canonical coefficients are one way to score the new data. The documentation link I sent shows two other equivalent methods that are equivalent.

1. center the observation by subtracting the mean of the original data (coordinate by coordinate)

2. Perform matrix multiplication of the centered observation times the raw canonical coefficients.

The documentation link I sent show the formula for this computation, as well as two other equivalent computations.