Calcite | Level 5

## String comparision-Jaccard distance

`I have Jaccard score in comparing two strings to check the similarity/Dissimlarity using R. I tried to replicate the same in SAS but couldn't achieve it. Can you please let me know if there is function/way to get jaccard score in SAS for  comparing two strings "Krishna" and "Krishna Reddy"I tried to replicate in SAS with proc distance but no luck.in Rlibrary(stringdist)stringdist('krishna', 'krishna reddy', method='jaccard')result is 0.3636`

1 ACCEPTED SOLUTION

Accepted Solutions
SAS Employee

## Re: String comparision-Jaccard distance

``````%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;

data &out.;
string = strip(prxchange('s#\s# #',-1,symget('string')));
do _n_ = 1 to lengthn(string)-&k.+1;
ngram = substr(string,_n_,&k.);
output;
end;
run;

%mend;

%macro jaccard
(string1
,string2
)
;

%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)

proc append base=s1 data=s2; run;

proc freq data=s1 noprint;
tables string*ngram / out=s2;
run;

proc transpose data=s2 out=s1(drop=_name_ _label_);
by string notsorted;
var count;
id ngram;
run;

proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;

proc distance data=s2 method=jaccard absent=0 out=s1;
var anominal(_numeric_);
id string;
run;

proc sql;
select &string1. as jaccard
into :jaccard
from s1
where string="&string2.";
quit;
%mend;

%jaccard(krishna,krishna reddy);run;``````

This is put together quickly.  It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used.  You can adjust the value of k to get different values.  I beleive setting to k=5 will give you approx the result in R (0.333....)

4 REPLIES 4
Super User

## Re: String comparision-Jaccard distance

I don't find a quick way to get a Jaccard score but SAS has two functions related to edit distance COMPGED and COMPLEV that may work for your purpose.

``````data _null_;
length x y \$ 50;
x = 'krishna';
y = 'krishna reddy';
compg = compged(x,y);
compl = complev(x,y);
put compg= compl=;
run;
``````

The additional function Call Compcost can be used to assign different weights to operations used in COMPGED.

Calcite | Level 5

## Re: String comparision-Jaccard distance

Thanks! I am aware of these levenshtein distance functions.

I am specifically looking for Jaccard to achieve the mentioned example through SAS.
SAS Employee

## Re: String comparision-Jaccard distance

``````%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;

data &out.;
string = strip(prxchange('s#\s# #',-1,symget('string')));
do _n_ = 1 to lengthn(string)-&k.+1;
ngram = substr(string,_n_,&k.);
output;
end;
run;

%mend;

%macro jaccard
(string1
,string2
)
;

%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)

proc append base=s1 data=s2; run;

proc freq data=s1 noprint;
tables string*ngram / out=s2;
run;

proc transpose data=s2 out=s1(drop=_name_ _label_);
by string notsorted;
var count;
id ngram;
run;

proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;

proc distance data=s2 method=jaccard absent=0 out=s1;
var anominal(_numeric_);
id string;
run;

proc sql;
select &string1. as jaccard
into :jaccard
from s1
where string="&string2.";
quit;
%mend;

%jaccard(krishna,krishna reddy);run;``````

This is put together quickly.  It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used.  You can adjust the value of k to get different values.  I beleive setting to k=5 will give you approx the result in R (0.333....)

Calcite | Level 5

## Re: String comparision-Jaccard distance

Thank you!

Discussion stats
• 4 replies
• 2631 views
• 0 likes
• 3 in conversation