Help using Base SAS procedures

Mistake with the distance procedure

Reply
Occasional Contributor
Posts: 17

Mistake with the distance procedure

Hello all,

I have been trying to produce a Bray-Curtis dissimilarity matrix for cluster analysis, but the output from proc distance does not appear to be giving me accurate values; the dissimilarities are either 0.5 or very close to that.  I have calculated the matrix 'by hand' using the formula:

 BC_{ij} = \frac{2C_{ij}}{S_i + S_j}

...and the results I get from the hand calculations are the same as produced in the R package vegan.

Can anyone see a mistake with my SAS code or lend any other insight to help me out?

proc distance data=rawdat out=dist method=braycurtis;

     var anominal (var1--var22);

     id location;

run;

The rawdat are log10 transformed count data, ranging from 0 to 3.4 after transformation.

Thanks in advance.

Super User
Posts: 17,958

Re: Mistake with the distance procedure

Sounds like a question for Tech Support to me.

I'm guessing and assuming you've already checked this but Your formula appears different the documentation, more like the Sorenson (Dice coeffcient) which is 1-BRAYCURTIS....

Respected Advisor
Posts: 4,660

Re: Mistake with the distance procedure

I suspect the problem resides with the log transformation (the BC distances might be calculated with integer-rounded values). Try calculating the distances with untransformed counts.

PG

PG
Respected Advisor
Posts: 4,660

Re: Mistake with the distance procedure

But wait! B-C requires matches... So it might be the other way around: what you are missing IS the rounding. With VARi = floor(log10(COUNTi)), you would be matching the counts that have the same order of magnitude and not the exact same value. That might be it!

PG

A simple test :

data rawdat;
array mu(3) (100 1000 10000);
array lcount(3);
length transform $12;
call streaminit(9876);
do id = 1 to 5;
location = put(id,2.);
transform = "Log10";
do i = 1 to 3;
  lcount(i) = log10(rand("Poisson",mu(i)));
end;
output;
transform = "Floor(log10)";
do i = 1 to 3;
  lcount(i) = floor(lcount(i));
end;
output;
end;
run;

proc sort data=rawdat; by transform location; run;

proc distance data=rawdat out=dist method=braycurtis;
by transform;
     var anominal (lcountSmiley Happy;
     id location;
run;

proc print data=dist; run;

PG

Message was edited to include test.

PG
Occasional Contributor
Posts: 17

Re: Mistake with the distance procedure

Thanks for the insights PG.  I ran the test program with both raw and log10 transformed and ended in the same resut:

Obs    transform       location       _1         _2         _3         _4      _5

          1    Floor(log10)       1        0.00000     .          .          .          .

          2    Floor(log10)       2        0.16667    0.00000     .          .          .

          3    Floor(log10)       3        0.16667    0.33333    0.00000     .          .

          4    Floor(log10)       4        0.00000    0.16667    0.16667    0.00000     .

          5    Floor(log10)       5        0.16667    0.00000    0.33333    0.16667     0

          6    Log10              1        0.00000     .          .          .          .

          7    Log10              2        0.50000    0.00000     .          .          .

          8    Log10              3        0.50000    0.50000    0.00000     .          .

          9    Log10              4        0.50000    0.50000    0.50000    0.00000     .

         10    Log10              5        0.50000    0.50000    0.50000    0.50000     0

This may very well be a question for tech support.  I have used the distance procedure for other calculations and it appears to be fine.  However, any method I try with 'anominal' type data, I have a problem similar to this, with or without transformation prior to running the procedure.

Cheers,

Ely

Respected Advisor
Posts: 4,660

Re: Mistake with the distance procedure

I don't think there is a bug. The distances are fine with the rounded values (upper matrix). The BC distances is a measure of match-mismatch. If your counts are relatively large, they almost never match, hence the 0.5 distance. The rounded log10 counts will match when the counts are the same order of magnitude, which makes more sense.

PG

PG
Ask a Question
Discussion stats
  • 5 replies
  • 745 views
  • 0 likes
  • 3 in conversation