Solved: Re: Unusual String Comparison Problem

PaigeMiller · Posted 11-13-2014 09:29 AM

Suppose I have a character string in SAS variable named COL1, and other character strings in a SAS variable named COL2.

If I wanted to, for example, match the first five characters of COL1 with the first five characters of COL2, this is easy to do in a datastep or SQL. But my problem today is that I want to match the string in COL1 with the string in COL2 that has the most characters matching (and the match must be consecutive, and begin in position 1 of the string)

So, for example

COL1 COL2

AA999AAB AA999

AA999ABB AA999A

AA999B

AA999AB

The string in COL2 that has the longest match to AA999AAB in COL1 is AA999A (i.e. AA999 in COL2 is not the match I want)

The string in COL2 that has the longest match to AA999ABB in COL1 is AA999AB (i.e. AA999 and AA999A is not the match I want)

and so on

How can I do this? I have written a very convoluted looping program to get this done, but I'm hoping there is a smarter and more efficient way, using any of: data step, SQL, regular expressions, IML, whatever.

Thanks!

--
Paige Miller

Haikuo · Posted 11-13-2014 12:05 PM

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched? Will they always compare from the first char? If both Yes, then try the following:

data a;

input COL1 : $20.;

cards;

AA999AAB

AA999ABB

AA999A

;

run;

data b;

input COL2 : $20.;

cards;

AA999

AA999A

AA999B

AA999AB

;

run;

proc sql;

create table want as

select col1, (select distinct col2 from b where col1 eqt trim(col2) having length(col2) = max(length(col2))) as col2

from a;

quit;

Good luck,

Haikuo

View solution in original post

RW9 · Posted 11-13-2014 09:56 AM

Well, there is the soundex function which codes a word up. Not sure how close though. Am also thinking of arrays but puzzling over a simple technique.

Rick_SAS · Posted 11-13-2014 10:00 AM

I don't have an immediate answer, but SAS has some "distance between words" functions that might help. I discuss them in this article:

http://blogs.sas.com/content/iml/2011/10/05/distances-between-words/

In the article I used the SPEDIS function, which is apparently not what you want. You want a function that "rewards" exact matches at the beginning of a word. Nevertheless, here is some code that shows how to use the SPEDIS function. Perhaps it will give you a place to start. If none of the functions in the article perform the match you want, then you will have to write the function "cost function" yourself.

data A;
input COL1 $12.;
datalines;
AA999AAB
AA999ABB
;
data B;
input COL2 $12.;
datalines;
AA999
AA999A
AA999B
AA999AB
;

proc iml;

use A; read all var {col1} into x; close A;
use B; read all var {col2} into y; close B;

cost = j(nrow(x), nrow(y),.);
do i = 1 to nrow(x);
word = x;
cost[i,] = spedis(y`, word);
end;
print cost[rowname=x colname=y];

Ksharp · Posted 11-13-2014 10:07 AM

Rick,

OP asked the first N character matched , not the minimize spell distance . COMPARE() might be a good choice.

Xia Keshan

Ksharp · Posted 11-13-2014 10:02 AM

How about this one :


data a;
input COL1  : $20.;
cards;    
AA999AAB  
AA999ABB 
;
run; 
                             

data b;
input COL2  : $20.;
cards;  
AA999
AA999A
AA999B
AA999AB
;
run; 
proc sql;
create table want as
 select * from a,b group by col1 having compare(col1,col2,'il')=max(compare(col1,col2,'il'));
quit;

Xia Keshan

PaigeMiller · Posted 11-13-2014 10:07 AM

DING DING DING DING!!

Looks like has the answer! That's amazing in its simplicity. Thank you!

--
Paige Miller

Ksharp · Posted 11-13-2014 10:17 AM

One thing need to remind : Is the length of COL1 is always greater than the length of COL2 ? otherwise ,you need to add abs() around it .

PaigeMiller · Posted 11-13-2014 10:23 AM

Ok, I was just thinking that COL1 length could be the same as COL2 length, but your statement "you need to add abs() around it" is somewhat vague ... add abs() around what?

Specifically, if COL1 also contains AA999A, I would want it to match AA999A in COL2, and because COMPARE returns a 0 when the strings are identical, the SQL needs to be modified for this case, but specifically how?

--
Paige Miller

Ksharp · Posted 11-13-2014 10:36 AM

OK.Check this example . the output is what you need ?

I have to go to sleep now. Bye.

data a;

input COL1 : $20.;

cards;

AA999AAB

AA999ABB

;

run;

data b;

input COL2 : $20.;

cards;

AA999

AA999A

AA999B

AA999ABASD

;

run;

proc sql;

create table want as

select * from a,b group by col1 having compare(col1,col2,'il')=max(compare(col1,col2,'il'));

quit;

Message was edited by: xia keshan

PaigeMiller · Posted 11-13-2014 10:44 AM

No that's the wrong result, AA999ABB in COL1 should match AA999A in COL2 in this case. COL2 will be shorter than or the same length as COL1, I don't think in my situation COL2 can be longer than COL1.

But specifically, if the code is

data a;
input COL1 : $20.;
cards;
AA999AAB
AA999ABB
AA999A
;
run;

data b;
input COL2 : $20.;
cards;
AA999
AA999A
AA999B
AA999AB
;
run;

Note: I have added AA999A to COL1, then the result I want is AA999AAB and AA999ABB are matched properly by your first code example, but AA999A in COL1 is matched to AA999 in COL2 using your first code example, and I want it matched to AA999A in COL2.

--
Paige Miller

PaigeMiller · Posted 11-13-2014 11:40 AM

Okay, I think I got it. The condition that must be included into the solution is that the length(COL1)>=length(COL2). This is easy to add in.

What is harder to adjust for is that if the two strings match (e.g. if COL1='AA999A' and COL2='AA999A'), then the COMPARE function gives us a zero as an answer, and so the MAX function in 's solution doesn't handle this case.

So, we create a function of the COL1 and COL2 values that turns the zero into a large number.

proc sql;

create table want as

select * from a,b group by col1 having

((compare(col1,col2,'il')>0)*compare(col1,col2,'il') + (compare(col1,col2,'il')=0)*999) =

max(((compare(col1,col2,'il')>0)*compare(col1,col2,'il') + (compare(col1,col2,'il')=0)*999))

and length(col1)>=length(col2);

quit;

So zeros get turned into 999 while the other results from COMPARE remain unchanged.

Here is the correct data set WANT:

Obs COL1 COL2

1     AA999A      AA999A
2     AA999AAB    AA999A
3     AA999ABB    AA999AB

Anyone got a simpler solution?

--
Paige Miller

Ksharp · Posted 11-14-2014 04:27 AM

As long as you make the length of col1 always greater or equal than the length of col2, and want include case they are same.

data a;
input COL1  : $20.;
cards;    
AA999AAB  
AA999ABB 
AA999A
;
run; 
                           
data b;
input COL2  : $20.;
cards;  
AA999
AA999A
AA999B
AA999AB
;
run;
proc sql;
create table want as
select *,case when compare(col1,col2,'il') eq 0 then 999999
              when compare(col1,col2,'il') gt 0 then   compare(col1,col2,'il')
                 end as dis
from a,b group by col1 having calculated dis=max(calculated dis) ;
quit;

Xia Keshan

art297 · Posted 11-13-2014 12:02 PM

Paige,

I liked 's solution, but it didn't work on the extra test cases I added. The compare function returns a 0 if there is an exact match, otherwise it returns the first character that is different. As such, here is my proposed solution:

data have;

input (COL1 COL2) ($);

cards;

AA999AAB AA999

AA999ABB AA999A

BB999AAB AA999B

. AA999AA

. AA999AB

. AA999ABB

;

data col1;

set have (keep=col1 where=(not missing(col1)));

run;

data col2;

set have (keep=col2);

run;

proc sql;

create table want as

select *,

case compare(col1,col2,'il')

when 0 then length(col1)

else compare(col1,col2,'il')-1

end as length

from col1,col2

group by col1

having (calculated length>0) and

calculated length=max(calculated length)

;

quit;

Haikuo · Posted 11-13-2014 12:05 PM

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched? Will they always compare from the first char? If both Yes, then try the following:

data a;

input COL1 : $20.;

cards;

AA999AAB

AA999ABB

AA999A

;

run;

data b;

input COL2 : $20.;

cards;

AA999

AA999A

AA999B

AA999AB

;

run;

proc sql;

create table want as

select col1, (select distinct col2 from b where col1 eqt trim(col2) having length(col2) = max(length(col2))) as col2

from a;

quit;

Good luck,

Haikuo

PaigeMiller · Posted 11-13-2014 12:23 PM

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched?

Here are the requirements. The desired output is when COL1 and COL2 have the same characters in positions 1 through n, where n is the maximum value of all comparisons of each COL1 with all COL2; and length(COL1)>=length(COL2)

--
Paige Miller

Registration is open

SAS Training: Just a Click Away