I have Jaccard score in comparing two strings to check the similarity/Dissimlarity using R.
I tried to replicate the same in SAS but couldn't achieve it.
Can you please let me know if there is function/way to get jaccard score in SAS for
comparing two strings "Krishna" and "Krishna Reddy"
I tried to replicate in SAS with proc distance but no luck.
in R
library(stringdist)
stringdist('krishna', 'krishna reddy', method='jaccard')
result is 0.3636
%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;
data &out.;
string = strip(prxchange('s#\s# #',-1,symget('string')));
do _n_ = 1 to lengthn(string)-&k.+1;
ngram = substr(string,_n_,&k.);
output;
end;
run;
%mend;
%macro jaccard
(string1
,string2
)
;
%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)
proc append base=s1 data=s2; run;
proc freq data=s1 noprint;
tables string*ngram / out=s2;
run;
proc transpose data=s2 out=s1(drop=_name_ _label_);
by string notsorted;
var count;
id ngram;
run;
proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;
proc distance data=s2 method=jaccard absent=0 out=s1;
var anominal(_numeric_);
id string;
run;
proc sql;
select &string1. as jaccard
into :jaccard
from s1
where string="&string2.";
quit;
%mend;
%jaccard(krishna,krishna reddy);run;
This is put together quickly. It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used. You can adjust the value of k to get different values. I beleive setting to k=5 will give you approx the result in R (0.333....)
I don't find a quick way to get a Jaccard score but SAS has two functions related to edit distance COMPGED and COMPLEV that may work for your purpose.
data _null_;
length x y $ 50;
x = 'krishna';
y = 'krishna reddy';
compg = compged(x,y);
compl = complev(x,y);
put compg= compl=;
run;
The additional function Call Compcost can be used to assign different weights to operations used in COMPGED.
%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;
data &out.;
string = strip(prxchange('s#\s# #',-1,symget('string')));
do _n_ = 1 to lengthn(string)-&k.+1;
ngram = substr(string,_n_,&k.);
output;
end;
run;
%mend;
%macro jaccard
(string1
,string2
)
;
%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)
proc append base=s1 data=s2; run;
proc freq data=s1 noprint;
tables string*ngram / out=s2;
run;
proc transpose data=s2 out=s1(drop=_name_ _label_);
by string notsorted;
var count;
id ngram;
run;
proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;
proc distance data=s2 method=jaccard absent=0 out=s1;
var anominal(_numeric_);
id string;
run;
proc sql;
select &string1. as jaccard
into :jaccard
from s1
where string="&string2.";
quit;
%mend;
%jaccard(krishna,krishna reddy);run;
This is put together quickly. It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used. You can adjust the value of k to get different values. I beleive setting to k=5 will give you approx the result in R (0.333....)
Thank you!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.