Solved: Unusual String Comparison Problem - Page 2

Haikuo · Posted 11-13-2014 12:42 PM

Paige,

Forgive me if I overlooked your comments, I am aware of that the length of COL2 is less or equal to COL1, but does COL2 have to be inclusive to COL1? For instance, consider the following examples:

for COL1=AA999AAB,

COL2 of AA999 is a better match or AA999AB? notice the latter has the bigger n? if it is the latter, then what is your rule of dealing ties? eg. AA999AB or AA999AC?

Thanks,

Haikuo

PaigeMiller · Posted 11-13-2014 12:51 PM

Hai.kuo wrote:

for COL1=AA999AAB,

COL2 of AA999 is a better match or AA999AB?

For the case above, COL2 of AA999 is the match I want, as the requirement seems to have been incompletely stated. When you match COL1 and COL2, there cannot be any mismatches, positions 1 thru length(COL2) must match, and I want the longest COL2 which has this property. If you match COL1=AA999AAB with COL2=AA999AB, there is a mismatch in postion 7, so it is disqualified.

--
Paige Miller

Haikuo · Posted 11-13-2014 12:54 PM

In that case, will my code suffice your need?

Haikuo

PaigeMiller · Posted 11-13-2014 01:00 PM

Yes , it seems to work as well (although of the 4 solutions that have now been provided, I understand three and I'm not really sure I see why your code works, but perhaps that just a matter of studying it further).

Anyway, thanks to everyone for helping me on this and providing some interesting learnings.

--
Paige Miller

Astounding · Posted 11-13-2014 12:23 PM

Lack of SQL skills would force me to take a totally different approach.

Step 1: Create a format that translates from the value of COL2 to the observation number holding that value.

Step 2: Loop through COL1 (all characters down to 1 character) until you find a match.

For example, hard-coding the format just for readability step 1 would look like this:

proc format;
value $recno

'AA999'='1'

'AA999A'='2'

'AA999B'='3'

'AA999AB'='4'

other='Not Found';

run;

In practice, I'm sure you would want to switch to a CNTLIN= data set.

Then step 2 uses the format:

data want;

set col1_source;

if COL1 > ' ' then do _n_=length(COL1) to 1 by -1 until (result ne 'Not Found');

result = put(substr(COL1, 1, _n_), $recno.);

end;

if result ne 'Not Found' then do;

recno = input(result, 8.);

set col2_dataset point=recno;

end;

run;

It's sophisticated, but not horribly complex.

Good luck.

PaigeMiller · Posted 11-13-2014 12:25 PM

Yes, . I originally had a loop solution, which does work, but in any event, the reason for my question was to see if simpler methods are available (and also hoping that I would learn something, which has happened!)

--
Paige Miller

art297 · Posted 11-13-2014 12:34 PM

Paige,

Did you try the code I had suggested? It gives the same results as your code but, additionally, excludes cases where there is no match.

Art

PaigeMiller · Posted 11-13-2014 12:40 PM

Seem to work, , thank you!

--
Paige Miller

jakarman · Posted 11-14-2014 05:44 AM

If you wanted to go for all possible solutions. I missed:

- hashing

- point= usage with two datasets going around independly

- ?

---->-- ja karman --<-----

Ksharp · Posted 11-15-2014 01:22 AM

Hi Paige,

Sorry . I found a problem with compare() . it return the first different character between two variable,no matter it is a blank or other character. Check the following the example , it is the wrong result.

data a;

input COL1 : $20.;

cards;

AA999AAB

AA999ABB

AA999A

;

run;

data b;

input COL2 : $20.;

cards;

AA999

AA999AA

AA999B

;

run;

proc sql;

create table want as

select *,case when compare(col1,col2,'il') eq 0 then 999999

when compare(col1,col2,'il') gt 0 then compare(col1,col2,'il')

end as dis

from a,b group by col1 having calculated dis=max(calculated dis) ;

quit;

And I think HaiKuo get the right idea. Use him code or Use the following code, they are the almost same.

data a;

input COL1 : $20.;

cards;

AA999AAB

AA999ABB

AA999AA

;

run;

data b;

input COL2 : $20.;

cards;

AA999

AA999AA

AA999B

;

run;

proc sql;

create table want as

select *

from a,b

where compare(strip(col1),strip(col2),'il:')=0

group by col1

having length(col2)=max(length(col2)) ;

quit;

Xia Keshan

Message was edited by: xia keshan

Ksharp · Posted 11-15-2014 10:09 PM

Hi Paige,

I figured out another way to get it ,if your table is big.

data a;
input COL1  : $20.;
cards;   
AA999AAB 
AA999ABB
AA999AA
BA
;
run;
                          
data b;
input COL2  : $20.;
cards; 
AA999
AA999AA
AA999B
;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set b;
  declare hash h(dataset:'b');
  h.definekey('col2');
  h.definedone();
end;
set a;
do i=length(col1) to 1 by -1;
 col2=substr(col1,1,i);
 if h.check()=0 then leave;
 call missing(col2);
end;
run;

Xia Keshan

Message was edited by: xia keshan

Haikuo · Posted 11-15-2014 10:24 PM

very nice. but lurking SAS forum on a Sunday morning, you are sick.:smileysilly: Okay, guess l shouldn't say that, as Saturday night is no better in this regard, so both of us need to get a life. LOL.

Ksharp · Posted 11-15-2014 10:45 PM

It doesn't matter . I just overlooked the complication of problem . It is noon in China now. Anyway, you did a good job, HaiKuo.

Xia Keshan

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

Re: Unusual String Comparison Problem

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away