Solved
Contributor
Posts: 51

# how to merge data based on similarity

Dear All

Suppose I have following two datasets, test1 and test2:

data test1;

input key1 \$ value1;

cards;

ABC 8

XYZ  9

;

data test2;

input key2 \$ value2;

cards;

ABC1 88

XYZ1  99

ABC2 888

XYZ2  999

;

I want to join the two datasets based on SIMILARITY between key1 and key2;

such as:

proc sql;

create table test as

select a.key1 as key,

b.key2 as key_alias,

a.value1 as value1,

b.value2 as value2

from test1 a

inner join

test2 b

on a. value1 SIMILAR TO b.value2;

quit;

The desired result would be:

KEY   KEY_ALIAS VALUE1  VALUE2

ABC    ABC1          8             88

ABC    ABC2          8             888

XYZ     XYZ1           9             99

XYZ     XYZ2           9             999

My question is: how should I write the sql statement (the one that is colored red)?

Accepted Solutions
Solution
‎07-21-2013 01:15 PM
Posts: 5,521

## Re: how to merge data based on similarity

It depends a lot on what you want to call similar and the size of the task. Here are two approaches:

data test1;
input key1 \$ value1;
cards;
ABC 8
XYZ 9
;

data test2;
input key2 \$ value2;
cards;
ABC1 88
XYZ1 99
ABC2 888
XYZ2 999

0ABC 8880

0XYZ 9990
ABCxxx 8888
XYZyyy 9999
;

proc sql;
create table testLong as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on complev(a.key1, b.key2, 2) < 2;

create table testShort as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on key1 eqt key2;
quit;

The first one is based on edit distance. It is quite flexible. The second approach simply checks equality of the keys up to the length of the shortest. Both methods require every key of your first table to be compared with every key of the second.

PG

PG

All Replies
Regular Contributor
Posts: 171

## Re: how to merge data based on similarity

Hi,

some one will come with better solution, the sample code will give what you want

proc sql;

create table want as

select a.*,b.*

from test1 as a inner join test2 as b

on a.key1=substr(key2,1,3)

order by a.key1,b.key2;

quit;

Thanks

Sam

Solution
‎07-21-2013 01:15 PM
Posts: 5,521

## Re: how to merge data based on similarity

It depends a lot on what you want to call similar and the size of the task. Here are two approaches:

data test1;
input key1 \$ value1;
cards;
ABC 8
XYZ 9
;

data test2;
input key2 \$ value2;
cards;
ABC1 88
XYZ1 99
ABC2 888
XYZ2 999

0ABC 8880

0XYZ 9990
ABCxxx 8888
XYZyyy 9999
;

proc sql;
create table testLong as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on complev(a.key1, b.key2, 2) < 2;

create table testShort as
select a.key1 as key,
b.key2 as key_alias,
a.value1 as value1,
b.value2 as value2
from test1 a
inner join
test2 b
on key1 eqt key2;
quit;

The first one is based on edit distance. It is quite flexible. The second approach simply checks equality of the keys up to the length of the shortest. Both methods require every key of your first table to be compared with every key of the second.

PG

PG
Contributor
Posts: 51

## Re: how to merge data based on similarity

Thank you very much. Your method is exactly what I am looking for.

In addition, may I ask: can your method be realized using data step instead of proc sql? Thank you very much again for help.

Posts: 5,521

## Re: how to merge data based on similarity

Yes, I can think of datastep methods involving arrays, random access or hash objects, but unless proven wrong, none is straightforward or clearly advantageous. - PG

PG
Contributor
Posts: 51

## Re: how to merge data based on similarity

Thank you very much. Really appreciate it.

Super User
Posts: 10,766

## Re: how to merge data based on similarity

There is an example in SQL documentation. You can check it out.

```data test1;
input key1 \$ value1;
cards;
ABC 8
XYZ  9
;

data test2;
input key2 \$ value2;
cards;
ABC1 88
XYZ1  99
ABC2 888
XYZ2  999
;
run;
proc sql;
create table want as
select *
from test1,test2
where key2 contains strip(key1);
quit;
```

Ksharp

🔒 This topic is solved and locked.