Solved: Re: how to merge data based on similarity

abcd123 · Posted 07-21-2013 12:04 PM

Dear All

Suppose I have following two datasets, test1 and test2:

data test1;

input key1 $ value1;

cards;

ABC 8

XYZ 9

;

data test2;

input key2 $ value2;

cards;

ABC1 88

XYZ1 99

ABC2 888

XYZ2 999

;

I want to join the two datasets based on SIMILARITY between key1 and key2;

such as:

proc sql;

create table test as

select a.key1 as key,

b.key2 as key_alias,

a.value1 as value1,

b.value2 as value2

from test1 a

inner join

test2 b

on a. value1 SIMILAR TO b.value2;

quit;

The desired result would be:

KEY KEY_ALIAS VALUE1 VALUE2

ABC ABC1 8 88

ABC ABC2 8 888

XYZ XYZ1 9 99

XYZ XYZ2 9 999

My question is: how should I write the sql statement (the one that is colored red)?

PGStats · Posted 07-21-2013 01:15 PM

It depends a lot on what you want to call similar and the size of the task. Here are two approaches:

data test1;
input key1 $ value1;
cards;
ABC 8
XYZ 9
;

data test2;
input key2 $ value2;
cards;
ABC1 88
XYZ1 99
ABC2 888
XYZ2 999

0ABC 8880

0XYZ 9990
ABCxxx 8888
XYZyyy 9999
;

proc sql;
create table testLong as
select a.key1 as key,
         b.key2 as key_alias,
         a.value1 as value1,
         b.value2 as value2
from test1 a
       inner join
       test2 b
on complev(a.key1, b.key2, 2) < 2;

create table testShort as
select a.key1 as key,
         b.key2 as key_alias,
         a.value1 as value1,
         b.value2 as value2
from test1 a
       inner join
       test2 b
on key1 eqt key2;
quit;

The first one is based on edit distance. It is quite flexible. The second approach simply checks equality of the keys up to the length of the shortest. Both methods require every key of your first table to be compared with every key of the second.

PG

View solution in original post

sam369 · Posted 07-21-2013 12:33 PM

Hi,

some one will come with better solution, the sample code will give what you want

proc sql;

create table want as

select a.*,b.*

from test1 as a inner join test2 as b

on a.key1=substr(key2,1,3)

order by a.key1,b.key2;

quit;

Thanks

Sam

PGStats · Posted 07-21-2013 01:15 PM

It depends a lot on what you want to call similar and the size of the task. Here are two approaches:

data test1;
input key1 $ value1;
cards;
ABC 8
XYZ 9
;

data test2;
input key2 $ value2;
cards;
ABC1 88
XYZ1 99
ABC2 888
XYZ2 999

0ABC 8880

0XYZ 9990
ABCxxx 8888
XYZyyy 9999
;

proc sql;
create table testLong as
select a.key1 as key,
         b.key2 as key_alias,
         a.value1 as value1,
         b.value2 as value2
from test1 a
       inner join
       test2 b
on complev(a.key1, b.key2, 2) < 2;

create table testShort as
select a.key1 as key,
         b.key2 as key_alias,
         a.value1 as value1,
         b.value2 as value2
from test1 a
       inner join
       test2 b
on key1 eqt key2;
quit;

The first one is based on edit distance. It is quite flexible. The second approach simply checks equality of the keys up to the length of the shortest. Both methods require every key of your first table to be compared with every key of the second.

PG

abcd123 · Posted 07-21-2013 03:58 PM

Thank you very much. Your method is exactly what I am looking for.

In addition, may I ask: can your method be realized using data step instead of proc sql? Thank you very much again for help.

PGStats · Posted 07-21-2013 05:14 PM

Yes, I can think of datastep methods involving arrays, random access or hash objects, but unless proven wrong, none is straightforward or clearly advantageous. - PG

PG

abcd123 · Posted 07-21-2013 05:53 PM

Thank you very much. Really appreciate it.

Ksharp · Posted 07-21-2013 09:47 PM

There is an example in SQL documentation. You can check it out.

data test1;
input key1 $ value1;
cards;
ABC 8
XYZ  9
;

data test2;
input key2 $ value2;
cards;
ABC1 88
XYZ1  99
ABC2 888
XYZ2  999
;
run;
proc sql;
create table want as
 select * 
  from test1,test2
   where key2 contains strip(key1);
quit;

Ksharp

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away