I have Jaccard score in comparing two strings to check the similarity/Dissimlarity using R.
I tried to replicate the same in SAS but couldn't achieve it.
Can you please let me know if there is function/way to get jaccard score in SAS for
comparing two strings "Krishna" and "Krishna Reddy"
I tried to replicate in SAS with proc distance but no luck.
in R
library(stringdist)
stringdist('krishna', 'krishna reddy', method='jaccard')
result is 0.3636
%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;
data &out.;
string = strip(prxchange('s#\s# #',-1,symget('string')));
do _n_ = 1 to lengthn(string)-&k.+1;
ngram = substr(string,_n_,&k.);
output;
end;
run;
%mend;
%macro jaccard
(string1
,string2
)
;
%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)
proc append base=s1 data=s2; run;
proc freq data=s1 noprint;
tables string*ngram / out=s2;
run;
proc transpose data=s2 out=s1(drop=_name_ _label_);
by string notsorted;
var count;
id ngram;
run;
proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;
proc distance data=s2 method=jaccard absent=0 out=s1;
var anominal(_numeric_);
id string;
run;
proc sql;
select &string1. as jaccard
into :jaccard
from s1
where string="&string2.";
quit;
%mend;
%jaccard(krishna,krishna reddy);run;
This is put together quickly. It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used. You can adjust the value of k to get different values. I beleive setting to k=5 will give you approx the result in R (0.333....)
I don't find a quick way to get a Jaccard score but SAS has two functions related to edit distance COMPGED and COMPLEV that may work for your purpose.
data _null_;
length x y $ 50;
x = 'krishna';
y = 'krishna reddy';
compg = compged(x,y);
compl = complev(x,y);
put compg= compl=;
run;
The additional function Call Compcost can be used to assign different weights to operations used in COMPGED.
%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;
data &out.;
string = strip(prxchange('s#\s# #',-1,symget('string')));
do _n_ = 1 to lengthn(string)-&k.+1;
ngram = substr(string,_n_,&k.);
output;
end;
run;
%mend;
%macro jaccard
(string1
,string2
)
;
%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)
proc append base=s1 data=s2; run;
proc freq data=s1 noprint;
tables string*ngram / out=s2;
run;
proc transpose data=s2 out=s1(drop=_name_ _label_);
by string notsorted;
var count;
id ngram;
run;
proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;
proc distance data=s2 method=jaccard absent=0 out=s1;
var anominal(_numeric_);
id string;
run;
proc sql;
select &string1. as jaccard
into :jaccard
from s1
where string="&string2.";
quit;
%mend;
%jaccard(krishna,krishna reddy);run;
This is put together quickly. It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used. You can adjust the value of k to get different values. I beleive setting to k=5 will give you approx the result in R (0.333....)
Thank you!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.