BookmarkSubscribeRSS Feed
joneryn
Calcite | Level 5

For example, there is a data set like this:

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14

F1  1  1  0  0  0  0  0  0  0  0  0  0  0  0

F2  1  1  0  0  0  0  0  0  0  0  0  0  0  0

F3  0  0  1  0  0  0  0  0  0  0  0  0  1  0

F4  0  0  1  1  0  0  0  0  0  0  0  0  1  0

F5  0  0  0  0  1  1  1  0  0  0  0  0  0  0

F6  0  0  0  0  0  1  1  1  0  0  0  0  0  0

F7  0  0  0  0  0  0  0  0  1  1  1  0  0  0

F8  0  0  0  0  0  0  0  0  0  0  1  1  0  0

F9  0  0  0  0  0  0  0  0  0  0  1  1  0  0

F10 0  0  0  0  0  0  0  0  0  0  0  0  1  1

How to transform this data matrix into dissimilar matrix through Jaccard index?

Then calculation the distance between the two of F1-F10 ? How to calculate the distance matrix?

Based on these, I want to do cluster analysis among F1-F10.

I‘m a beginner. I really want to know how to programme it.

Thank you very much!

2 REPLIES 2
PGStats
Opal | Level 21

The Jaccard index is a similarity measure. For clustering, you need a dissimilarity measure (a distance) such as DJACCARD or Bray-Curtis. You can check the definitions in the SAS doc at :

http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_distance_sec...

or in the reference :

Legendre, Pierre & Louis Legendre. 1998. Numerical ecology. 2nd English

edition. Elsevier Science BV, Amsterdam.

xv + 853 pages

Here is how to do it in SAS:

data test;
input id $ M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14;
datalines;
F1  1  1  0  0  0  0  0  0  0  0  0  0  0  0
F2  1  1  0  0  0  0  0  0  0  0  0  0  0  0
F3  0  0  1  0  0  0  0  0  0  0  0  0  1  0
F4  0  0  1  1  0  0  0  0  0  0  0  0  1  0
F5  0  0  0  0  1  1  1  0  0  0  0  0  0  0
F6  0  0  0  0  0  1  1  1  0  0  0  0  0  0
F7  0  0  0  0  0  0  0  0  1  1  1  0  0  0
F8  0  0  0  0  0  0  0  0  0  0  1  1  0  0
F9  0  0  0  0  0  0  0  0  0  0  1  1  0  0
F10 0  0  0  0  0  0  0  0  0  0  0  0  1  1
;

proc distance data=test method= /*BRAYCURTIS*/ DJACCARD out=testDist;
var anominal(M: / absent=0);   /* M: means all variable names starting with M */
id id;
run;

proc cluster method=AVERAGE data=testDist outtree=testTree print=0;
ID id;
run;

The CLUSTER procedure will give you a dendrogram by default and you can use the testTree dataset as input to PROC TREE for further manipulation.

PG

PG
joneryn
Calcite | Level 5

Thanks for your help! I think you give me power to learn it. Smiley Happy Best wishes!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 1151 views
  • 1 like
  • 2 in conversation