BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Krishnam
Calcite | Level 5
I have Jaccard score in comparing two strings to check the similarity/Dissimlarity using R. 
I tried to replicate the same in SAS but couldn't achieve it.
Can you please let me know if there is function/way to get jaccard score in SAS for
comparing two strings "Krishna" and "Krishna Reddy"

I tried to replicate in SAS with proc distance but no luck.

in R
library(stringdist)
stringdist('krishna', 'krishna reddy', method='jaccard')

result is 0.3636

 

1 ACCEPTED SOLUTION

Accepted Solutions
FriedEgg
SAS Employee
%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;

data &out.;
   string = strip(prxchange('s#\s# #',-1,symget('string')));
   do _n_ = 1 to lengthn(string)-&k.+1;
      ngram = substr(string,_n_,&k.);
	  output;
   end;
run;

%mend;

%macro jaccard
(string1
,string2
)
;

%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)

proc append base=s1 data=s2; run;

proc freq data=s1 noprint;
   tables string*ngram / out=s2;
run;

proc transpose data=s2 out=s1(drop=_name_ _label_); 
by string notsorted;
var count;
id ngram;
run;

proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;

proc distance data=s2 method=jaccard absent=0 out=s1; 
var anominal(_numeric_);
id string;
run;

proc sql;
select &string1. as jaccard
  into :jaccard
  from s1
 where string="&string2.";
quit;
%mend;

%jaccard(krishna,krishna reddy);run;

This is put together quickly.  It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used.  You can adjust the value of k to get different values.  I beleive setting to k=5 will give you approx the result in R (0.333....)

View solution in original post

4 REPLIES 4
ballardw
Super User

I don't find a quick way to get a Jaccard score but SAS has two functions related to edit distance COMPGED and COMPLEV that may work for your purpose.

data _null_;
   length x y $ 50;
   x = 'krishna';
   y = 'krishna reddy';
   compg = compged(x,y); 
   compl = complev(x,y);
   put compg= compl=;
run;

The additional function Call Compcost can be used to assign different weights to operations used in COMPGED.

 

Krishnam
Calcite | Level 5
Thanks! I am aware of these levenshtein distance functions.

I am specifically looking for Jaccard to achieve the mentioned example through SAS.
FriedEgg
SAS Employee
%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;

data &out.;
   string = strip(prxchange('s#\s# #',-1,symget('string')));
   do _n_ = 1 to lengthn(string)-&k.+1;
      ngram = substr(string,_n_,&k.);
	  output;
   end;
run;

%mend;

%macro jaccard
(string1
,string2
)
;

%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)

proc append base=s1 data=s2; run;

proc freq data=s1 noprint;
   tables string*ngram / out=s2;
run;

proc transpose data=s2 out=s1(drop=_name_ _label_); 
by string notsorted;
var count;
id ngram;
run;

proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;

proc distance data=s2 method=jaccard absent=0 out=s1; 
var anominal(_numeric_);
id string;
run;

proc sql;
select &string1. as jaccard
  into :jaccard
  from s1
 where string="&string2.";
quit;
%mend;

%jaccard(krishna,krishna reddy);run;

This is put together quickly.  It does not match the results from the R package for your example, but it does match most other Jaccard Simmillarity Metrics I have used.  You can adjust the value of k to get different values.  I beleive setting to k=5 will give you approx the result in R (0.333....)

Krishnam
Calcite | Level 5

Thank you!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 2344 views
  • 0 likes
  • 3 in conversation