BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BayleyVos
Calcite | Level 5

I'm calculating the similarity between two text values (a text response and a standard dataset of responses).

 

For example, the text responses could be ('palms','east rand mall','sandton city mall').  The standard dataset is ('Palm Centre', 'East Rand Mall','Sandton City',etc...).

 

So I'm calculating the similarity between the two.  Given a certain similarity criteria, it should output the value from the standard dataset).  Just to clarity the text responses can be anything that someone can text - open text.

 

I'm calculating the similarity factor as follows:

---------------------------------------------------------------------------------------------------------------------

 

data new_branch_list (keep = branch);
set branch_list;

cost1 = min(250,spedis(strip(lowcase(branch)),'east rand mall'));
cost2 = min(250,spedis('east rand mall',strip(lowcase(branch))));
cost3 = max(cost1,cost2);

if cost3 > 0 then similarity = (1 - (min(cost1,cost2)/250));
else similarity = 1;

if similarity > 0.9 then output;

run;

 

---------------------------------------------------------------------------------------------------------------------

 

The branch_list mentioned in the code is the standard set of inputs.

In this example I've used the text input of 'east rand mall'.

 

How do I turn this into a function, for which I can input a data set column (with the text values) and then it calculates the correct standard dataset value, based on the similarity factor criteria?

 

I would ideally use this in a proc sql step, by calling the function.

 

(I'm using SAS version 7.13 HF4 (64-bit))

 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
RW9
Diamond | Level 26 RW9
Diamond | Level 26

I see where the confusion arises.  The macro code I presented was in answer to;

"I would ideally use this in a proc sql step"

Hence why I presented it using an example proc sql step:

%macro similarity;
  min(250,spedis(strip(lowcase(branch)),'east rand mall')) as cost1,
  min(250,spedis('east rand mall',strip(lowcase(branch)))) as cost2,
  max(caculated cost1,caculated cost2) as cost3,
  ifn(caculated cost3 > 0,(1 - (min(calcualted cost1,calculated cost2)/250)),1) as similarity,
  case when calculated similarity > 0.9 then "Y" else "" end as flag
%mend similarity;

proc sql;
  select %similarity
  from...;
quit;

I did not present a datastep example, for that you would want something like your original code, bu macrotised.  Maybe:

%macro similarity (out=,text=);
  data &out. (keep = branch);
    set branch_list;
    cost1 = min(250,spedis(strip(lowcase(branch)),"&text."));
    cost2 = min(250,spedis("&text.",strip(lowcase(branch))));
    cost3 = max(cost1,cost2);
    if cost3 > 0 then similarity = (1 - (min(cost1,cost2)/250));
    else similarity = 1;
    if similarity > 0.9 then output;
  run;
%mend similarity;

%similarity (out=test1,text=east rand mall);
%similarity (out=test2,text=marua mall);
...

 Datastep and SQL syntax's are very different, so what works for one doesn't necessarily work for the other, and there are many ways to implement any problem.

View solution in original post

11 REPLIES 11
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Personally I would avoid "functions".  You can macro-tise that code pretty easily.  You could also change it so that it works in SQL, however that is different syntax to the one presented below, your macro call needs to generate valid SQL syntax.  So something like:

 

 

%macro similarity;
  min(250,spedis(strip(lowcase(branch)),'east rand mall')) as cost1,
  min(250,spedis('east rand mall',strip(lowcase(branch)))) as cost2,
  max(caculated cost1,caculated cost2) as cost3,
  ifn(caculated cost3 > 0,(1 - (min(calcualted cost1,calculated cost2)/250)),1) as similarity,
  case when calculated similarity > 0.9 then "Y" else "" end as flag
%mend similarity;

proc sql;
  select %similarity
  from...;
quit;
BayleyVos
Calcite | Level 5

Thank you so much...

 

But how do I work in the data set that is needed for the calculations?  Because I do not see it mentioned in the macro.

That is the 'branch-list' table that I mentioned in my original post?

 

Or do I not need to specify it?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

You would need to supply test data in the form of a datastep:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

What you want out from it, and code you want to use the macro in.  I can't tell you from the post you have made, that is why I just gave you an example of how to use macro to create code which would work in SQL.  

BayleyVos
Calcite | Level 5

Thanks again.

 

I ran the following code to test it out, but I'm getting a few errors:

That I do not understand.  Can you perhaps help? (It doesn't use any data that you would need).

 

I would really appreciate it 🙂

 

Thanks again.

 

%macro similarity(act_branch,inp_branch);
	min(250,spedis(strip(lowcase(act_branch)),inp_branch)) as cost1,
	min(250,spedis(inp_branch,strip(lowcase(act_branch)))) as cost2,
	max(caculated cost1,caculated cost2) as cost3,
	ifn(caculated cost3 > 0,(1 - (min(calcualted cost1,calculated cost2)/250)),1) as similarity,
	case when calculated similarity > 0.9 then true else false end as flag
%mend similarity;

data test;
similarity = %similarity('Maerua Mall','marua mall');
run;
RW9
Diamond | Level 26 RW9
Diamond | Level 26

I see where the confusion arises.  The macro code I presented was in answer to;

"I would ideally use this in a proc sql step"

Hence why I presented it using an example proc sql step:

%macro similarity;
  min(250,spedis(strip(lowcase(branch)),'east rand mall')) as cost1,
  min(250,spedis('east rand mall',strip(lowcase(branch)))) as cost2,
  max(caculated cost1,caculated cost2) as cost3,
  ifn(caculated cost3 > 0,(1 - (min(calcualted cost1,calculated cost2)/250)),1) as similarity,
  case when calculated similarity > 0.9 then "Y" else "" end as flag
%mend similarity;

proc sql;
  select %similarity
  from...;
quit;

I did not present a datastep example, for that you would want something like your original code, bu macrotised.  Maybe:

%macro similarity (out=,text=);
  data &out. (keep = branch);
    set branch_list;
    cost1 = min(250,spedis(strip(lowcase(branch)),"&text."));
    cost2 = min(250,spedis("&text.",strip(lowcase(branch))));
    cost3 = max(cost1,cost2);
    if cost3 > 0 then similarity = (1 - (min(cost1,cost2)/250));
    else similarity = 1;
    if similarity > 0.9 then output;
  run;
%mend similarity;

%similarity (out=test1,text=east rand mall);
%similarity (out=test2,text=marua mall);
...

 Datastep and SQL syntax's are very different, so what works for one doesn't necessarily work for the other, and there are many ways to implement any problem.

BayleyVos
Calcite | Level 5

Yes, sorry.  I think I'm not explaining my problem sufficiently well - I'm very new to SAS (so please bear with me for a while).

 

Thank you for your input! It has greatly helped me!

BayleyVos
Calcite | Level 5

I used this code (you provided - thank you) to construct the macro:

 

%macro similarity (out=,text=);
  data &out. (keep = branch);
    set branch_list;
    cost1 = min(250,spedis(strip(lowcase(branch)),"&text."));
    cost2 = min(250,spedis("&text.",strip(lowcase(branch))));
    cost3 = max(cost1,cost2);
    if cost3 > 0 then similarity = (1 - (min(cost1,cost2)/250));
    else similarity = 1;
    if similarity > 0.9 then output;
  run;
%mend similarity;

Now, I want to use this macro in the following way:

 

proc sql;
	select	*,
		%similarity(test1, a.message) as branch_der
	from	na_pacc_responses as a;
quit;

So I'm passing the message column from the na_pass_responses table through the similarity macro - to obtain the output.  However, it giving me the following error:

 

ERROR:  More positional parameters found than defined.

 

I'm really sorry! But I desperately need help!

What am I doing wrong that it doesn't work?

 

Thanks again for all your help!!!

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Sorry, I am not going over it again, please re-read my posts above.  SQL syntax is not the same as datastep code.  Your macro needs to resolve to syntactically correct code - all macro is is a find replace.  Your code resolves to:

proc sql;
select *,
data test1 (keep = branch);
set branch_list;
cost1 = min(250,spedis(strip(lowcase(branch)),"&text."));
cost2 = min(250,spedis("a.message",strip(lowcase(branch))));
cost3 = max(cost1,cost2);
if cost3 > 0 then similarity = (1 - (min(cost1,cost2)/250));
else similarity = 1;
if similarity > 0.9 then output;
run;
as branch_der
from na_pacc_responses as a;
quit;

 Which is clearly wrong.

BayleyVos
Calcite | Level 5

I am really sorry.  I'm just very new to SAS - so I my understanding is lacking, which I understand must be frustrating to you.  I do apologise.

 

Thank you anyway for your time and effort.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

No worries.  You have the two different examples, one datastep and one sql.  Have a go and come back with (start a new thread) with any questions.

PGStats
Opal | Level 21

Have you considered using function COMPLEV instead? It is faster and symmetric. It also provides a limit to the distance that it will evaluate and useful modifiers.

PG

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 1478 views
  • 0 likes
  • 3 in conversation