DATA Step, Macro, Functions and more

String comparision-Jaccard distance

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

String comparision-Jaccard distance

I have Jaccard score in comparing two strings to check the similarity/Dissimlarity using R. 
I tried to replicate the same in SAS but couldn't achieve it.
Can you please let me know if there is function/way to get jaccard score in SAS for
comparing two strings "Krishna" and "Krishna Reddy"

I tried to replicate in SAS with proc distance but no luck.

in R
library(stringdist)
stringdist('krishna', 'krishna reddy', method='jaccard')

result is 0.3636

 


Accepted Solutions
Solution
‎11-09-2015 12:57 AM
Trusted Advisor
Posts: 1,301

Re: String comparision-Jaccard distance

%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;

data &out.;
   string = strip(prxchange('s#\s# #',-1,symget('string')));
   do _n_ = 1 to lengthn(string)-&k.+1;
      ngram = substr(string,_n_,&k.);
	  output;
   end;
run;

%mend;

%macro jaccard
(string1
,string2
)
;

%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)

proc append base=s1 data=s2; run;

proc freq data=s1 noprint;
   tables string*ngram / out=s2;
run;

proc transpose data=s2 out=s1(drop=_name_ _label_); 
by string notsorted;
var count;
id ngram;
run;

proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;

proc distance data=s2 method=jaccard absent=0 out=s1; 
var anominal(_numeric_);
id string;
run;

proc sql;
select &string1. as jaccard
  into :jaccard
  from s1
 where string="&string2.";
quit;
%mend;

%jaccard(krishna,krishna reddy);run;

This is put together quickly.  It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used.  You can adjust the value of k to get different values.  I beleive setting to k=5 will give you approx the result in R (0.333....)

View solution in original post


All Replies
Super User
Posts: 11,343

Re: String comparision-Jaccard distance

I don't find a quick way to get a Jaccard score but SAS has two functions related to edit distance COMPGED and COMPLEV that may work for your purpose.

data _null_;
   length x y $ 50;
   x = 'krishna';
   y = 'krishna reddy';
   compg = compged(x,y); 
   compl = complev(x,y);
   put compg= compl=;
run;

The additional function Call Compcost can be used to assign different weights to operations used in COMPGED.

 

Occasional Contributor
Posts: 5

Re: String comparision-Jaccard distance

Thanks! I am aware of these levenshtein distance functions.

I am specifically looking for Jaccard to achieve the mentioned example through SAS.
Solution
‎11-09-2015 12:57 AM
Trusted Advisor
Posts: 1,301

Re: String comparision-Jaccard distance

%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;

data &out.;
   string = strip(prxchange('s#\s# #',-1,symget('string')));
   do _n_ = 1 to lengthn(string)-&k.+1;
      ngram = substr(string,_n_,&k.);
	  output;
   end;
run;

%mend;

%macro jaccard
(string1
,string2
)
;

%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)

proc append base=s1 data=s2; run;

proc freq data=s1 noprint;
   tables string*ngram / out=s2;
run;

proc transpose data=s2 out=s1(drop=_name_ _label_); 
by string notsorted;
var count;
id ngram;
run;

proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;

proc distance data=s2 method=jaccard absent=0 out=s1; 
var anominal(_numeric_);
id string;
run;

proc sql;
select &string1. as jaccard
  into :jaccard
  from s1
 where string="&string2.";
quit;
%mend;

%jaccard(krishna,krishna reddy);run;

This is put together quickly.  It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used.  You can adjust the value of k to get different values.  I beleive setting to k=5 will give you approx the result in R (0.333....)

Occasional Contributor
Posts: 5

Re: String comparision-Jaccard distance

Thank you!

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 561 views
  • 0 likes
  • 3 in conversation