Suppose I have a character string in SAS variable named COL1, and other character strings in a SAS variable named COL2.
If I wanted to, for example, match the first five characters of COL1 with the first five characters of COL2, this is easy to do in a datastep or SQL. But my problem today is that I want to match the string in COL1 with the string in COL2 that has the most characters matching (and the match must be consecutive, and begin in position 1 of the string)
So, for example
COL1 COL2
AA999AAB AA999
AA999ABB AA999A
AA999B
AA999AB
The string in COL2 that has the longest match to AA999AAB in COL1 is AA999A (i.e. AA999 in COL2 is not the match I want)
The string in COL2 that has the longest match to AA999ABB in COL1 is AA999AB (i.e. AA999 and AA999A is not the match I want)
and so on
How can I do this? I have written a very convoluted looping program to get this done, but I'm hoping there is a smarter and more efficient way, using any of: data step, SQL, regular expressions, IML, whatever.
Thanks!
Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched? Will they always compare from the first char? If both Yes, then try the following:
data a;
input COL1 : $20.;
cards;
AA999AAB
AA999ABB
AA999A
;
run;
data b;
input COL2 : $20.;
cards;
AA999
AA999A
AA999B
AA999AB
;
run;
proc sql;
create table want as
select col1, (select distinct col2 from b where col1 eqt trim(col2) having length(col2) = max(length(col2))) as col2
from a;
quit;
Good luck,
Haikuo
Well, there is the soundex function which codes a word up. Not sure how close though. Am also thinking of arrays but puzzling over a simple technique.
I don't have an immediate answer, but SAS has some "distance between words" functions that might help. I discuss them in this article:
http://blogs.sas.com/content/iml/2011/10/05/distances-between-words/
In the article I used the SPEDIS function, which is apparently not what you want. You want a function that "rewards" exact matches at the beginning of a word. Nevertheless, here is some code that shows how to use the SPEDIS function. Perhaps it will give you a place to start. If none of the functions in the article perform the match you want, then you will have to write the function "cost function" yourself.
data A;
input COL1 $12.;
datalines;
AA999AAB
AA999ABB
;
data B;
input COL2 $12.;
datalines;
AA999
AA999A
AA999B
AA999AB
;
proc iml;
use A; read all var {col1} into x; close A;
use B; read all var {col2} into y; close B;
cost = j(nrow(x), nrow(y),.);
do i = 1 to nrow(x);
word = x;
cost[i,] = spedis(y`, word);
end;
print cost[rowname=x colname=y];
Rick,
OP asked the first N character matched , not the minimize spell distance . COMPARE() might be a good choice.
Xia Keshan
How about this one :
data a; input COL1 : $20.; cards; AA999AAB AA999ABB ; run; data b; input COL2 : $20.; cards; AA999 AA999A AA999B AA999AB ; run; proc sql; create table want as select * from a,b group by col1 having compare(col1,col2,'il')=max(compare(col1,col2,'il')); quit;
Xia Keshan
One thing need to remind : Is the length of COL1 is always greater than the length of COL2 ? otherwise ,you need to add abs() around it .
Ok, I was just thinking that COL1 length could be the same as COL2 length, but your statement "you need to add abs() around it" is somewhat vague ... add abs() around what?
Specifically, if COL1 also contains AA999A, I would want it to match AA999A in COL2, and because COMPARE returns a 0 when the strings are identical, the SQL needs to be modified for this case, but specifically how?
OK.Check this example . the output is what you need ?
I have to go to sleep now. Bye.
data a;
input COL1 : $20.;
cards;
AA999AAB
AA999ABB
;
run;
data b;
input COL2 : $20.;
cards;
AA999
AA999A
AA999B
AA999ABASD
;
run;
proc sql;
create table want as
select * from a,b group by col1 having compare(col1,col2,'il')=max(compare(col1,col2,'il'));
quit;
Message was edited by: xia keshan
No that's the wrong result, AA999ABB in COL1 should match AA999A in COL2 in this case. COL2 will be shorter than or the same length as COL1, I don't think in my situation COL2 can be longer than COL1.
But specifically, if the code is
data a;
input COL1 : $20.;
cards;
AA999AAB
AA999ABB
AA999A
;
run;
data b;
input COL2 : $20.;
cards;
AA999
AA999A
AA999B
AA999AB
;
run;
Note: I have added AA999A to COL1, then the result I want is AA999AAB and AA999ABB are matched properly by your first code example, but AA999A in COL1 is matched to AA999 in COL2 using your first code example, and I want it matched to AA999A in COL2.
Okay, I think I got it. The condition that must be included into the solution is that the length(COL1)>=length(COL2). This is easy to add in.
What is harder to adjust for is that if the two strings match (e.g. if COL1='AA999A' and COL2='AA999A'), then the COMPARE function gives us a zero as an answer, and so the MAX function in 's solution doesn't handle this case.
So, we create a function of the COL1 and COL2 values that turns the zero into a large number.
proc sql;
create table want as
select * from a,b group by col1 having
((compare(col1,col2,'il')>0)*compare(col1,col2,'il') + (compare(col1,col2,'il')=0)*999) =
max(((compare(col1,col2,'il')>0)*compare(col1,col2,'il') + (compare(col1,col2,'il')=0)*999))
and length(col1)>=length(col2);
quit;
So zeros get turned into 999 while the other results from COMPARE remain unchanged.
Here is the correct data set WANT:
Obs COL1 COL2
1 AA999A AA999A
2 AA999AAB AA999A
3 AA999ABB AA999AB
Anyone got a simpler solution?
As long as you make the length of col1 always greater or equal than the length of col2, and want include case they are same.
data a; input COL1 : $20.; cards; AA999AAB AA999ABB AA999A ; run; data b; input COL2 : $20.; cards; AA999 AA999A AA999B AA999AB ; run; proc sql; create table want as select *,case when compare(col1,col2,'il') eq 0 then 999999 when compare(col1,col2,'il') gt 0 then compare(col1,col2,'il') end as dis from a,b group by col1 having calculated dis=max(calculated dis) ; quit;
Xia Keshan
Paige,
I liked 's solution, but it didn't work on the extra test cases I added. The compare function returns a 0 if there is an exact match, otherwise it returns the first character that is different. As such, here is my proposed solution:
data have;
input (COL1 COL2) ($);
cards;
AA999AAB AA999
AA999ABB AA999A
BB999AAB AA999B
. AA999AA
. AA999AB
. AA999ABB
;
data col1;
set have (keep=col1 where=(not missing(col1)));
run;
data col2;
set have (keep=col2);
run;
proc sql;
create table want as
select *,
case compare(col1,col2,'il')
when 0 then length(col1)
else compare(col1,col2,'il')-1
end as length
from col1,col2
group by col1
having (calculated length>0) and
calculated length=max(calculated length)
;
quit;
Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched? Will they always compare from the first char? If both Yes, then try the following:
data a;
input COL1 : $20.;
cards;
AA999AAB
AA999ABB
AA999A
;
run;
data b;
input COL2 : $20.;
cards;
AA999
AA999A
AA999B
AA999AB
;
run;
proc sql;
create table want as
select col1, (select distinct col2 from b where col1 eqt trim(col2) having length(col2) = max(length(col2))) as col2
from a;
quit;
Good luck,
Haikuo
Your requirement still seems a little vague to me, will COL2 have to inclusive to COL1 if matched?
Here are the requirements. The desired output is when COL1 and COL2 have the same characters in positions 1 through n, where n is the maximum value of all comparisons of each COL1 with all COL2; and length(COL1)>=length(COL2)
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.