Dear All
Suppose I have following two datasets, test1 and test2:
data test1;
input key1 $ value1;
cards;
ABC 8
XYZ 9
;
data test2;
input key2 $ value2;
cards;
ABC1 88
XYZ1 99
ABC2 888
XYZ2 999
;
I want to join the two datasets based on SIMILARITY between key1 and key2;
such as:
proc sql;
create table test as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on a. value1 SIMILAR TO b.value2;
quit;
The desired result would be:
KEY KEY_ALIAS VALUE1 VALUE2
ABC ABC1 8 88
ABC ABC2 8 888
XYZ XYZ1 9 99
XYZ XYZ2 9 999
My question is: how should I write the sql statement (the one that is colored red)?
It depends a lot on what you want to call similar and the size of the task. Here are two approaches:
data test1;
input key1 $ value1;
cards;
ABC 8
XYZ 9
;
data test2;
input key2 $ value2;
cards;
ABC1 88
XYZ1 99
ABC2 888
XYZ2 999
0ABC 8880
0XYZ 9990
ABCxxx 8888
XYZyyy 9999
;
proc sql;
create table testLong as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on complev(a.key1, b.key2, 2) < 2;
create table testShort as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on key1 eqt key2;
quit;
The first one is based on edit distance. It is quite flexible. The second approach simply checks equality of the keys up to the length of the shortest. Both methods require every key of your first table to be compared with every key of the second.
PG
Hi,
some one will come with better solution, the sample code will give what you want
proc sql;
create table want as
select a.*,b.*
from test1 as a inner join test2 as b
on a.key1=substr(key2,1,3)
order by a.key1,b.key2;
quit;
Thanks
Sam
It depends a lot on what you want to call similar and the size of the task. Here are two approaches:
data test1;
input key1 $ value1;
cards;
ABC 8
XYZ 9
;
data test2;
input key2 $ value2;
cards;
ABC1 88
XYZ1 99
ABC2 888
XYZ2 999
0ABC 8880
0XYZ 9990
ABCxxx 8888
XYZyyy 9999
;
proc sql;
create table testLong as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on complev(a.key1, b.key2, 2) < 2;
create table testShort as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on key1 eqt key2;
quit;
The first one is based on edit distance. It is quite flexible. The second approach simply checks equality of the keys up to the length of the shortest. Both methods require every key of your first table to be compared with every key of the second.
PG
Thank you very much. Your method is exactly what I am looking for.
In addition, may I ask: can your method be realized using data step instead of proc sql? Thank you very much again for help.
Yes, I can think of datastep methods involving arrays, random access or hash objects, but unless proven wrong, none is straightforward or clearly advantageous. - PG
Thank you very much. Really appreciate it.
There is an example in SQL documentation. You can check it out.
data test1; input key1 $ value1; cards; ABC 8 XYZ 9 ; data test2; input key2 $ value2; cards; ABC1 88 XYZ1 99 ABC2 888 XYZ2 999 ; run; proc sql; create table want as select * from test1,test2 where key2 contains strip(key1); quit;
Ksharp
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.