BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
PaigeMiller
Diamond | Level 26

Suppose I have a character string in SAS variable named COL1, and other character strings in a SAS variable named COL2.

If I wanted to, for example, match the first five characters of COL1 with the first five characters of COL2, this is easy to do in a datastep or SQL. But my problem today is that I want to match the string in COL1 with the string in COL2 that has the most characters matching (and the match must be consecutive, and begin in position 1 of the string)

So, for example

COL1                          COL2

AA999AAB                      AA999

AA999ABB                      AA999A

                              AA999B

                              AA999AB

The string in COL2 that has the longest match to AA999AAB in COL1 is AA999A (i.e. AA999 in COL2 is not the match I want)

The string in COL2 that has the longest match to AA999ABB in COL1 is AA999AB (i.e. AA999 and AA999A is not the match I want)

and so on

How can I do this? I have written a very convoluted looping program to get this done, but I'm hoping there is a smarter and more efficient way, using any of: data step, SQL, regular expressions, IML, whatever.

Thanks!

--
Paige Miller
1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched? Will they always compare from the first char? If both Yes, then try the following:

data a;

     input COL1  : $20.;

     cards;

AA999AAB 

AA999ABB

AA999A

;

run;

data b;

     input COL2  : $20.;

     cards;

AA999

AA999A

AA999B

AA999AB

;

run;

proc sql;

     create table want as

           select col1, (select distinct col2 from b where col1 eqt trim(col2) having length(col2) = max(length(col2))) as col2

                from a;

quit;

Good luck,

Haikuo

View solution in original post

27 REPLIES 27
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, there is the soundex function which codes a word up.  Not sure how close though.  Am also thinking of arrays but puzzling over a simple technique.

Rick_SAS
SAS Super FREQ

I don't have an immediate answer, but SAS has some "distance between words" functions that might help. I discuss them in this article:

http://blogs.sas.com/content/iml/2011/10/05/distances-between-words/


In the article I used the SPEDIS function, which is apparently not what you want. You want a function that "rewards" exact matches at the beginning of a word.  Nevertheless, here is some code that shows how to use the SPEDIS function. Perhaps it will give you a place to start. If none of the functions in the article perform the match you want, then you will have to write the function "cost function" yourself.


data A;
input COL1 $12.;
datalines;
AA999AAB
AA999ABB
;
data B;
input COL2 $12.;
datalines;
AA999
AA999A
AA999B
AA999AB
;

proc iml;

use A; read all var {col1} into x; close A;
use B; read all var {col2} into y; close B;


cost = j(nrow(x), nrow(y),.);
do i = 1 to nrow(x);
   word = x;
   cost[i,] = spedis(y`, word);
end;
print cost[rowname=x colname=y];

Ksharp
Super User

Rick,

OP asked the first N character matched , not the minimize spell distance . COMPARE() might be a good choice.

Xia Keshan

Ksharp
Super User

How about this one :


data a;
input COL1  : $20.;
cards;    
AA999AAB  
AA999ABB 
;
run; 
                             

data b;
input COL2  : $20.;
cards;  
AA999
AA999A
AA999B
AA999AB
;
run; 
proc sql;
create table want as
 select * from a,b group by col1 having compare(col1,col2,'il')=max(compare(col1,col2,'il'));
quit;     

Xia Keshan

PaigeMiller
Diamond | Level 26

DING DING DING DING!!

Looks like has the answer! That's amazing in its simplicity. Thank you!

--
Paige Miller
Ksharp
Super User

One thing need to remind : Is the length of COL1 is always greater than the length of COL2 ? otherwise ,you need to add abs() around it .

PaigeMiller
Diamond | Level 26

Ok, I was just thinking that COL1 length could be the same as COL2 length, but your statement "you need to add abs() around it" is somewhat vague ... add abs() around what?

Specifically, if COL1 also contains AA999A, I would want it to match AA999A in COL2, and because COMPARE returns a 0 when the strings are identical, the SQL needs to be modified for this case, but specifically how?

--
Paige Miller
Ksharp
Super User

OK.Check this example . the output is what you need ?

I have to go to sleep now. Bye.

data a;

input COL1  : $20.;

cards;  

AA999AAB

AA999ABB

;

run;

                           

data b;

input COL2  : $20.;

cards;

AA999

AA999A

AA999B

AA999ABASD

;

run;

proc sql;

create table want as

select * from a,b group by col1 having compare(col1,col2,'il')=max(compare(col1,col2,'il'));

quit;   

Message was edited by: xia keshan

PaigeMiller
Diamond | Level 26

No that's the wrong result, AA999ABB in COL1 should match AA999A in COL2 in this case. COL2 will be shorter than or the same length as COL1, I don't think in my situation COL2 can be longer than COL1.

But specifically, if the code is

data a;
input COL1  : $20.;
cards;   
AA999AAB 
AA999ABB
AA999A
;
run;
                           

data b;
input COL2  : $20.;
cards; 
AA999
AA999A
AA999B
AA999AB
;
run;

Note: I have added AA999A to COL1, then the result I want is AA999AAB and AA999ABB are matched properly by your first code example, but AA999A in COL1 is matched to AA999 in COL2 using your first code example, and I want it matched to AA999A in COL2.

--
Paige Miller
PaigeMiller
Diamond | Level 26

Okay, I think I got it. The condition that must be included into the solution is that the length(COL1)>=length(COL2). This is easy to add in.

What is harder to adjust for is that if the two strings match (e.g. if COL1='AA999A' and COL2='AA999A'), then the COMPARE function gives us a zero as an answer, and so the MAX function in 's solution doesn't handle this case.

So, we create a function of the COL1 and COL2 values that turns the zero into a large number.

proc sql;

create table want as

select * from a,b group by col1 having

((compare(col1,col2,'il')>0)*compare(col1,col2,'il') + (compare(col1,col2,'il')=0)*999) =

max(((compare(col1,col2,'il')>0)*compare(col1,col2,'il') + (compare(col1,col2,'il')=0)*999))

and length(col1)>=length(col2);

quit;

So zeros get turned into 999 while the other results from COMPARE remain unchanged.

Here is the correct data set WANT:

Obs      COL1       COL2

1     AA999A      AA999A
2     AA999AAB    AA999A
3     AA999ABB    AA999AB

Anyone got a simpler solution?

--
Paige Miller
Ksharp
Super User

As long as you make the length of col1 always greater or equal than the length of col2, and want include case they are same.

data a;
input COL1  : $20.;
cards;    
AA999AAB  
AA999ABB 
AA999A
;
run; 
                           
data b;
input COL2  : $20.;
cards;  
AA999
AA999A
AA999B
AA999AB
;
run;
proc sql;
create table want as
select *,case when compare(col1,col2,'il') eq 0 then 999999
              when compare(col1,col2,'il') gt 0 then   compare(col1,col2,'il')
                 end as dis
from a,b group by col1 having calculated dis=max(calculated dis) ;
quit;   

Xia Keshan

art297
Opal | Level 21

Paige,

I liked 's solution, but it didn't work on the extra test cases I added. The compare function returns a 0 if there is an exact match, otherwise it returns the first character that is different.  As such, here is my proposed solution:

data have;

  input (COL1  COL2) ($);

  cards;

AA999AAB                      AA999

AA999ABB                      AA999A

BB999AAB                      AA999B

.                             AA999AA

.                             AA999AB

.                             AA999ABB

;

data col1;

  set have (keep=col1 where=(not missing(col1)));

run;

data col2;

  set have (keep=col2);

run;

proc sql;

  create table want as

    select *,

      case compare(col1,col2,'il')

        when 0 then length(col1)

        else compare(col1,col2,'il')-1

      end as length

        from col1,col2

          group by col1

            having (calculated length>0) and

                    calculated length=max(calculated length)

  ;

quit;

Haikuo
Onyx | Level 15

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched? Will they always compare from the first char? If both Yes, then try the following:

data a;

     input COL1  : $20.;

     cards;

AA999AAB 

AA999ABB

AA999A

;

run;

data b;

     input COL2  : $20.;

     cards;

AA999

AA999A

AA999B

AA999AB

;

run;

proc sql;

     create table want as

           select col1, (select distinct col2 from b where col1 eqt trim(col2) having length(col2) = max(length(col2))) as col2

                from a;

quit;

Good luck,

Haikuo

PaigeMiller
Diamond | Level 26

Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched?

Here are the requirements. The desired output is when COL1 and COL2 have the same characters in positions 1 through n, where n is the maximum value of all comparisons of each COL1 with all COL2; and length(COL1)>=length(COL2)

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 27 replies
  • 4532 views
  • 3 likes
  • 8 in conversation