DATA Step, Macro, Functions and more

Unusual String Comparison Problem

Accepted Solution Solved
Reply
Trusted Advisor
Posts: 1,934
Accepted Solution

Unusual String Comparison Problem

Suppose I have a character string in SAS variable named COL1, and other character strings in a SAS variable named COL2.

If I wanted to, for example, match the first five characters of COL1 with the first five characters of COL2, this is easy to do in a datastep or SQL. But my problem today is that I want to match the string in COL1 with the string in COL2 that has the most characters matching (and the match must be consecutive, and begin in position 1 of the string)

So, for example

COL1                          COL2

AA999AAB                      AA999

AA999ABB                      AA999A

                              AA999B

                              AA999AB

The string in COL2 that has the longest match to AA999AAB in COL1 is AA999A (i.e. AA999 in COL2 is not the match I want)

The string in COL2 that has the longest match to AA999ABB in COL1 is AA999AB (i.e. AA999 and AA999A is not the match I want)

and so on

How can I do this? I have written a very convoluted looping program to get this done, but I'm hoping there is a smarter and more efficient way, using any of: data step, SQL, regular expressions, IML, whatever.

Thanks!


Accepted Solutions
Solution
‎11-13-2014 12:05 PM
Respected Advisor
Posts: 3,156

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched? Will they always compare from the first char? If both Yes, then try the following:

data a;

     input COL1  : $20.;

     cards;

AA999AAB 

AA999ABB

AA999A

;

run;

data b;

     input COL2  : $20.;

     cards;

AA999

AA999A

AA999B

AA999AB

;

run;

proc sql;

     create table want as

           select col1, (select distinct col2 from b where col1 eqt trim(col2) having length(col2) = max(length(col2))) as col2

                from a;

quit;

Good luck,

Haikuo

View solution in original post


All Replies
Super User
Super User
Posts: 7,997

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

Well, there is the soundex function which codes a word up.  Not sure how close though.  Am also thinking of arrays but puzzling over a simple technique.

SAS Super FREQ
Posts: 3,756

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

I don't have an immediate answer, but SAS has some "distance between words" functions that might help. I discuss them in this article:

http://blogs.sas.com/content/iml/2011/10/05/distances-between-words/


In the article I used the SPEDIS function, which is apparently not what you want. You want a function that "rewards" exact matches at the beginning of a word.  Nevertheless, here is some code that shows how to use the SPEDIS function. Perhaps it will give you a place to start. If none of the functions in the article perform the match you want, then you will have to write the function "cost function" yourself.


data A;
input COL1 $12.;
datalines;
AA999AAB
AA999ABB
;
data B;
input COL2 $12.;
datalines;
AA999
AA999A
AA999B
AA999AB
;

proc iml;

use A; read all var {col1} into x; close A;
use B; read all var {col2} into y; close B;


cost = j(nrow(x), nrow(y),.);
do i = 1 to nrow(x);
   word = x;
   cost[i,] = spedis(y`, word);
end;
print cost[rowname=x colname=y];

Super User
Posts: 10,046

Re: Unusual String Comparison Problem

Rick,

OP asked the first N character matched , not the minimize spell distance . COMPARE() might be a good choice.

Xia Keshan

Super User
Posts: 10,046

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

How about this one :


data a;
input COL1  : $20.;
cards;    
AA999AAB  
AA999ABB 
;
run; 
                             

data b;
input COL2  : $20.;
cards;  
AA999
AA999A
AA999B
AA999AB
;
run; 
proc sql;
create table want as
 select * from a,b group by col1 having compare(col1,col2,'il')=max(compare(col1,col2,'il'));
quit;     

Xia Keshan

Trusted Advisor
Posts: 1,934

Re: Unusual String Comparison Problem

DING DING DING DING!!

Looks like has the answer! That's amazing in its simplicity. Thank you!

Super User
Posts: 10,046

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

One thing need to remind : Is the length of COL1 is always greater than the length of COL2 ? otherwise ,you need to add abs() around it .

Trusted Advisor
Posts: 1,934

Re: Unusual String Comparison Problem

Ok, I was just thinking that COL1 length could be the same as COL2 length, but your statement "you need to add abs() around it" is somewhat vague ... add abs() around what?

Specifically, if COL1 also contains AA999A, I would want it to match AA999A in COL2, and because COMPARE returns a 0 when the strings are identical, the SQL needs to be modified for this case, but specifically how?

Super User
Posts: 10,046

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

OK.Check this example . the output is what you need ?

I have to go to sleep now. Bye.

data a;

input COL1  : $20.;

cards;  

AA999AAB

AA999ABB

;

run;

                           

data b;

input COL2  : $20.;

cards;

AA999

AA999A

AA999B

AA999ABASD

;

run;

proc sql;

create table want as

select * from a,b group by col1 having compare(col1,col2,'il')=max(compare(col1,col2,'il'));

quit;   

Message was edited by: xia keshan

Trusted Advisor
Posts: 1,934

Re: Unusual String Comparison Problem

No that's the wrong result, AA999ABB in COL1 should match AA999A in COL2 in this case. COL2 will be shorter than or the same length as COL1, I don't think in my situation COL2 can be longer than COL1.

But specifically, if the code is

data a;
input COL1  : $20.;
cards;   
AA999AAB 
AA999ABB
AA999A
;
run;
                           

data b;
input COL2  : $20.;
cards; 
AA999
AA999A
AA999B
AA999AB
;
run;

Note: I have added AA999A to COL1, then the result I want is AA999AAB and AA999ABB are matched properly by your first code example, but AA999A in COL1 is matched to AA999 in COL2 using your first code example, and I want it matched to AA999A in COL2.

Trusted Advisor
Posts: 1,934

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

Okay, I think I got it. The condition that must be included into the solution is that the length(COL1)>=length(COL2). This is easy to add in.

What is harder to adjust for is that if the two strings match (e.g. if COL1='AA999A' and COL2='AA999A'), then the COMPARE function gives us a zero as an answer, and so the MAX function in 's solution doesn't handle this case.

So, we create a function of the COL1 and COL2 values that turns the zero into a large number.

proc sql;

create table want as

select * from a,b group by col1 having

((compare(col1,col2,'il')>0)*compare(col1,col2,'il') + (compare(col1,col2,'il')=0)*999) =

max(((compare(col1,col2,'il')>0)*compare(col1,col2,'il') + (compare(col1,col2,'il')=0)*999))

and length(col1)>=length(col2);

quit;

So zeros get turned into 999 while the other results from COMPARE remain unchanged.

Here is the correct data set WANT:

Obs      COL1       COL2

1     AA999A      AA999A
2     AA999AAB    AA999A
3     AA999ABB    AA999AB

Anyone got a simpler solution?

Super User
Posts: 10,046

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

As long as you make the length of col1 always greater or equal than the length of col2, and want include case they are same.

data a;
input COL1  : $20.;
cards;    
AA999AAB  
AA999ABB 
AA999A
;
run; 
                           
data b;
input COL2  : $20.;
cards;  
AA999
AA999A
AA999B
AA999AB
;
run;
proc sql;
create table want as
select *,case when compare(col1,col2,'il') eq 0 then 999999
              when compare(col1,col2,'il') gt 0 then   compare(col1,col2,'il')
                 end as dis
from a,b group by col1 having calculated dis=max(calculated dis) ;
quit;   

Xia Keshan

PROC Star
Posts: 7,492

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

Paige,

I liked 's solution, but it didn't work on the extra test cases I added. The compare function returns a 0 if there is an exact match, otherwise it returns the first character that is different.  As such, here is my proposed solution:

data have;

  input (COL1  COL2) ($);

  cards;

AA999AAB                      AA999

AA999ABB                      AA999A

BB999AAB                      AA999B

.                             AA999AA

.                             AA999AB

.                             AA999ABB

;

data col1;

  set have (keep=col1 where=(not missing(col1)));

run;

data col2;

  set have (keep=col2);

run;

proc sql;

  create table want as

    select *,

      case compare(col1,col2,'il')

        when 0 then length(col1)

        else compare(col1,col2,'il')-1

      end as length

        from col1,col2

          group by col1

            having (calculated length>0) and

                    calculated length=max(calculated length)

  ;

quit;

Solution
‎11-13-2014 12:05 PM
Respected Advisor
Posts: 3,156

Re: Unusual String Comparison Problem

Posted in reply to PaigeMiller

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched? Will they always compare from the first char? If both Yes, then try the following:

data a;

     input COL1  : $20.;

     cards;

AA999AAB 

AA999ABB

AA999A

;

run;

data b;

     input COL2  : $20.;

     cards;

AA999

AA999A

AA999B

AA999AB

;

run;

proc sql;

     create table want as

           select col1, (select distinct col2 from b where col1 eqt trim(col2) having length(col2) = max(length(col2))) as col2

                from a;

quit;

Good luck,

Haikuo

Trusted Advisor
Posts: 1,934

Re: Unusual String Comparison Problem

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched?

Here are the requirements. The desired output is when COL1 and COL2 have the same characters in positions 1 through n, where n is the maximum value of all comparisons of each COL1 with all COL2; and length(COL1)>=length(COL2)

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 27 replies
  • 855 views
  • 3 likes
  • 8 in conversation