## Merging based on a large number of conditions

Solved
Occasional Contributor
Posts: 5

# Merging based on a large number of conditions

Good morning,

I have to merge 2 datasets (one-to-many) on a long list of condition, more than 20 factors. So if I have to use Proc SQL, the section of "ON a.var=b.var" will be very long.

I wonder if there is any better way to do the merge both in terms of the code and the speed.

Many thanks,

JHN

data condition;
input time one two three ;
datalines;
1 3 4 5
2 . 6 8
3 2 . .
4 9 6 .
;;;;

data source;
input number_client one two three ;
datalines;
7 3 4 5
8 3 6 8
9 2 5 6
10 2 6 9
11 3 6 8
12 3 6 9
;;
run;

proc sql;
create table want
as select * from condition as a left join source as b
on (a.one = b.one or a.one is null)
and (a.two = b.two or a.two is null)
and (a.three = b.three or a.three is null)
; quit;

Accepted Solutions
Solution
‎02-08-2014 02:30 PM
Posts: 5,528

## Re: Merging based on a large number of conditions

Joining on OR conditions is always problematic. You could try going from wide to long before the query :

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condList(keep=time i item);

set condition;

array items{*} one -- three;

do i = 1 to dim(items);

item = items{i};

if not missing(item) then output;

end;

run;

data sourceList(keep=client i item);

set source;

array items{*} one -- three;

do i = 1 to dim(items);

item = items{i};

output;

end;

run;

proc sql;

create table timeClient as

select time, client

from

condList as a inner join

sourceList as b on a.i=b.i

group by time, client

having min(a.item=b.item)=1;

create table want as

select c.*, client

from condition as c natural left join timeClient;

quit;

PG

PG

All Replies
Solution
‎02-08-2014 02:30 PM
Posts: 5,528

## Re: Merging based on a large number of conditions

Joining on OR conditions is always problematic. You could try going from wide to long before the query :

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condList(keep=time i item);

set condition;

array items{*} one -- three;

do i = 1 to dim(items);

item = items{i};

if not missing(item) then output;

end;

run;

data sourceList(keep=client i item);

set source;

array items{*} one -- three;

do i = 1 to dim(items);

item = items{i};

output;

end;

run;

proc sql;

create table timeClient as

select time, client

from

condList as a inner join

sourceList as b on a.i=b.i

group by time, client

having min(a.item=b.item)=1;

create table want as

select c.*, client

from condition as c natural left join timeClient;

quit;

PG

PG
Posts: 4,736

## Re: Merging based on a large number of conditions

I've run a few coding variations (inclusive Pierre's one) using fullstimer. The results were not conclusive and you will have to test with your real data.

What I don't understand: You say the relationships between your tables are "one to many" and this is also the case in the test data provided. In your real data: Couldn't it be that over time a patient has more than one condition? If so then the relationship would be many to many.

Eg. you give us this row in the condition data set: "3 2 . .". Does this mean that the "2" won't be used anywhere else, eg. for something like "3 2 . 6" ?

You're having "time" in your condition data set. In case it's many to many and you would only want to pick the latest matching condition then a data step approach would beat any SQL as you could stop comparisons as soon as there is a match (having the condition dataset sorted by descending time).

About the 20 factors and typing: You could always write a little macro which generates the repetitive code.

And here the code variations. On my laptop there is not really a clear "winner" - but as said: This might look very different when running stuff with real data.

options fullstimer;

data condition;
input time one two three;
datalines;
1 3 4 5
2 . 6 8
3 2 . .
4 9 6 .
;;;;

data source;
do number_client=1 to 1000000;
one=ceil(ranuni(1)*9);
two=ceil(ranuni(1)*9);
three=ceil(ranuni(1)*9);
output;
end;
stop;
run;

/* variant 1  */
proc sql feedback;
create table want1 as
select
a.time,
a.one as a_one, b.one as b_one,
a.two as a_two, b.two as b_two,
a.three as a_three, b.three as b_three,
b.number_client
from condition as a left join source as b
on (a.one = b.one or a.one is null)
and (a.two = b.two or a.two is null)
and (a.three = b.three or a.three is null)
;
quit;

/* variant 2  */
proc sql feedback;
create table want2 as
select
a.time,
a.one as a_one, b.one as b_one,
a.two as a_two, b.two as b_two,
a.three as a_three, b.three as b_three,
b.number_client
from condition as a left join source as b
on (
b.one =coalesce(a.one,b.one)
and b.two =coalesce(a.two,b.two)
and b.three =coalesce(a.three,b.three)
)
;
quit;

/* variant 3  */

data want3(drop=_;
set source
(rename=(one=_one two=_two three=_three))
;

do _i=1 to nobs;
set condition nobs=nobs point=_i;
if
_one =coalesce(one,_one)
and _two =coalesce(two,_two)
and _three =coalesce(three,_three) then
output;
end;
run;

sasfile work.condition close;

/* variant 4  */
data condList(keep=time i item);
set condition;
array items{*} one -- three;

do i = 1 to dim(items);
item = items{i};

if not missing(item) then
output;
end;
run;

data sourceList(keep=number_client i item);
set source;
array items{*} one -- three;

do i = 1 to dim(items);
item = items{i};
output;
end;
run;

proc sql;
create table timeClient as

select time, number_client

from

condList as a inner join

sourceList as b on a.i=b.i

group by time, number_client

having min(a.item=b.item)=1;
create table want as

select c.*, number_client

from condition as c natural left join timeClient;
quit;

Occasional Contributor
Posts: 5

## Re: Merging based on a large number of conditions

Hi PR and Patric.

Thank you for you help.

They work perfectly.

JHN

Posts: 4,736

## Re: Merging based on a large number of conditions

If there are not too many different codes then you could also expand your condition table so that it contains all the combinations (and no more missings). This would then allow to use a hash lookup.

Posts: 5,528

## Re: Merging based on a large number of conditions

Theoretically, the join could also be based on the LIKE operator:

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condChar;

set condition;

length str \$32;

array items{*} one -- three;

do i = 1 to dim(items);

if missing(items{i}) then

substr(str,i,1) = "_";

else

substr(str,i,1) = put(items{i}, 1.);

end;

run;

data sourceChar;

set source;

length str \$32;

array items{*} one -- three;

do i = 1 to dim(items);

substr(str,i,1) = put(items{i}, 1.);

end;

run;

proc sql;

create table want2 as

select a.*, b.client

from

condChar as a left join

sourceChar as b on b.str like a.str;

quit;

But proc SQL warns that it cannot optimize this join.

PG

PG
🔒 This topic is solved and locked.