Hello all,
I have been trying to produce a Bray-Curtis dissimilarity matrix for cluster analysis, but the output from proc distance does not appear to be giving me accurate values; the dissimilarities are either 0.5 or very close to that. I have calculated the matrix 'by hand' using the formula:
...and the results I get from the hand calculations are the same as produced in the R package vegan.
Can anyone see a mistake with my SAS code or lend any other insight to help me out?
proc distance data=rawdat out=dist method=braycurtis;
var anominal (var1--var22);
id location;
run;
The rawdat are log10 transformed count data, ranging from 0 to 3.4 after transformation.
Thanks in advance.
Sounds like a question for Tech Support to me.
I'm guessing and assuming you've already checked this but Your formula appears different the documentation, more like the Sorenson (Dice coeffcient) which is 1-BRAYCURTIS....
I suspect the problem resides with the log transformation (the BC distances might be calculated with integer-rounded values). Try calculating the distances with untransformed counts.
PG
But wait! B-C requires matches... So it might be the other way around: what you are missing IS the rounding. With VARi = floor(log10(COUNTi)), you would be matching the counts that have the same order of magnitude and not the exact same value. That might be it!
PG
A simple test :
data rawdat;
array mu(3) (100 1000 10000);
array lcount(3);
length transform $12;
call streaminit(9876);
do id = 1 to 5;
location = put(id,2.);
transform = "Log10";
do i = 1 to 3;
lcount(i) = log10(rand("Poisson",mu(i)));
end;
output;
transform = "Floor(log10)";
do i = 1 to 3;
lcount(i) = floor(lcount(i));
end;
output;
end;
run;
proc sort data=rawdat; by transform location; run;
proc distance data=rawdat out=dist method=braycurtis;
by transform;
var anominal (lcount:);
id location;
run;
proc print data=dist; run;
PG
Message was edited to include test.
Thanks for the insights PG. I ran the test program with both raw and log10 transformed and ended in the same resut:
Obs transform location _1 _2 _3 _4 _5
1 Floor(log10) 1 0.00000 . . . .
2 Floor(log10) 2 0.16667 0.00000 . . .
3 Floor(log10) 3 0.16667 0.33333 0.00000 . .
4 Floor(log10) 4 0.00000 0.16667 0.16667 0.00000 .
5 Floor(log10) 5 0.16667 0.00000 0.33333 0.16667 0
6 Log10 1 0.00000 . . . .
7 Log10 2 0.50000 0.00000 . . .
8 Log10 3 0.50000 0.50000 0.00000 . .
9 Log10 4 0.50000 0.50000 0.50000 0.00000 .
10 Log10 5 0.50000 0.50000 0.50000 0.50000 0
This may very well be a question for tech support. I have used the distance procedure for other calculations and it appears to be fine. However, any method I try with 'anominal' type data, I have a problem similar to this, with or without transformation prior to running the procedure.
Cheers,
Ely
I don't think there is a bug. The distances are fine with the rounded values (upper matrix). The BC distances is a measure of match-mismatch. If your counts are relatively large, they almost never match, hence the 0.5 distance. The rounded log10 counts will match when the counts are the same order of magnitude, which makes more sense.
PG
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.