Merging based on a large number of conditions

Accepted Solution Solved
Reply
Occasional Contributor JMD
Occasional Contributor
Posts: 5
Accepted Solution

Merging based on a large number of conditions

Good morning,

I have to merge 2 datasets (one-to-many) on a long list of condition, more than 20 factors. So if I have to use Proc SQL, the section of "ON a.var=b.var" will be very long.

I wonder if there is any better way to do the merge both in terms of the code and the speed.

Many thanks,

JHN

data condition;
input time one two three ;
datalines;
1 3 4 5
2 . 6 8
3 2 . .
4 9 6 .
;;;;

data source;
input number_client one two three ;
datalines;
7 3 4 5
8 3 6 8
9 2 5 6
10 2 6 9
11 3 6 8
12 3 6 9
;;
run;

proc sql;
create table want
as select * from condition as a left join source as b
on (a.one = b.one or a.one is null)
and (a.two = b.two or a.two is null)
and (a.three = b.three or a.three is null)
; quit;


Accepted Solutions
Solution
‎02-08-2014 02:30 PM
Respected Advisor
Posts: 4,663

Re: Merging based on a large number of conditions

Joining on OR conditions is always problematic. You could try going from wide to long before the query :

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condList(keep=time i item);

set condition;

array items{*} one -- three;

do i = 1 to dim(items);

  item = items{i};

  if not missing(item) then output;

  end;

run;

data sourceList(keep=client i item);

set source;

array items{*} one -- three;

do i = 1 to dim(items);

  item = items{i};

  output;

  end;

run;

proc sql;

create table timeClient as

select time, client

from

  condList as a inner join

  sourceList as b on a.i=b.i

group by time, client

having min(a.item=b.item)=1;

create table want as

select c.*, client

from condition as c natural left join timeClient;

quit;

PG

PG

View solution in original post


All Replies
Solution
‎02-08-2014 02:30 PM
Respected Advisor
Posts: 4,663

Re: Merging based on a large number of conditions

Joining on OR conditions is always problematic. You could try going from wide to long before the query :

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condList(keep=time i item);

set condition;

array items{*} one -- three;

do i = 1 to dim(items);

  item = items{i};

  if not missing(item) then output;

  end;

run;

data sourceList(keep=client i item);

set source;

array items{*} one -- three;

do i = 1 to dim(items);

  item = items{i};

  output;

  end;

run;

proc sql;

create table timeClient as

select time, client

from

  condList as a inner join

  sourceList as b on a.i=b.i

group by time, client

having min(a.item=b.item)=1;

create table want as

select c.*, client

from condition as c natural left join timeClient;

quit;

PG

PG
Respected Advisor
Posts: 3,908

Re: Merging based on a large number of conditions

I've run a few coding variations (inclusive Pierre's one) using fullstimer. The results were not conclusive and you will have to test with your real data.

What I don't understand: You say the relationships between your tables are "one to many" and this is also the case in the test data provided. In your real data: Couldn't it be that over time a patient has more than one condition? If so then the relationship would be many to many.

Eg. you give us this row in the condition data set: "3 2 . .". Does this mean that the "2" won't be used anywhere else, eg. for something like "3 2 . 6" ?

You're having "time" in your condition data set. In case it's many to many and you would only want to pick the latest matching condition then a data step approach would beat any SQL as you could stop comparisons as soon as there is a match (having the condition dataset sorted by descending time).

About the 20 factors and typing: You could always write a little macro which generates the repetitive code.

And here the code variations. On my laptop there is not really a clear "winner" - but as said: This might look very different when running stuff with real data.

options fullstimer;

data condition;
  input time one two three;
  datalines;
1 3 4 5
2 . 6 8
3 2 . .
4 9 6 .
;;;;

data source;
  do number_client=1 to 1000000;
    one=ceil(ranuni(1)*9);
    two=ceil(ranuni(1)*9);
    three=ceil(ranuni(1)*9);
    output;
  end;
  stop;
run;

/* variant 1  */
proc sql feedback;
  create table want1 as
    select
      a.time,
      a.one as a_one, b.one as b_one,
      a.two as a_two, b.two as b_two,
      a.three as a_three, b.three as b_three,
      b.number_client
    from condition as a left join source as b
      on (a.one = b.one or a.one is null)
      and (a.two = b.two or a.two is null)
      and (a.three = b.three or a.three is null)
  ;
quit;

/* variant 2  */
proc sql feedback;
  create table want2 as
    select
      a.time,
      a.one as a_one, b.one as b_one,
      a.two as a_two, b.two as b_two,
      a.three as a_three, b.three as b_three,
      b.number_client
    from condition as a left join source as b
      on (
      b.one =coalesce(a.one,b.one)
      and b.two =coalesce(a.two,b.two)
      and b.three =coalesce(a.three,b.three)
      )
  ;
quit;

/* variant 3  */
sasfile work.condition load;

data want3(drop=_Smiley Happy;
  set source
    (rename=(one=_one two=_two three=_three))
  ;

  do _i=1 to nobs;
    set condition nobs=nobs point=_i;
      if
      _one =coalesce(one,_one)
      and _two =coalesce(two,_two)
      and _three =coalesce(three,_three) then
    output;
  end;
run;

sasfile work.condition close;

/* variant 4  */
data condList(keep=time i item);
  set condition;
  array items{*} one -- three;

  do i = 1 to dim(items);
    item = items{i};

    if not missing(item) then
      output;
  end;
run;

data sourceList(keep=number_client i item);
  set source;
  array items{*} one -- three;

  do i = 1 to dim(items);
    item = items{i};
    output;
  end;
run;

proc sql;
  create table timeClient as

  select time, number_client

  from

    condList as a inner join

    sourceList as b on a.i=b.i

  group by time, number_client

  having min(a.item=b.item)=1;
  create table want as

  select c.*, number_client

  from condition as c natural left join timeClient;
quit;

Occasional Contributor JMD
Occasional Contributor
Posts: 5

Re: Merging based on a large number of conditions

Hi PR and Patric.

Thank you for you help.

They work perfectly.

JHN

Respected Advisor
Posts: 3,908

Re: Merging based on a large number of conditions

If there are not too many different codes then you could also expand your condition table so that it contains all the combinations (and no more missings). This would then allow to use a hash lookup.

Respected Advisor
Posts: 4,663

Re: Merging based on a large number of conditions

Theoretically, the join could also be based on the LIKE operator:

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condChar;

set condition;

length str $32;

array items{*} one -- three;

do i = 1 to dim(items);

  if missing(items{i}) then

  substr(str,i,1) = "_";

  else

  substr(str,i,1) = put(items{i}, 1.);

  end;

run;

data sourceChar;

set source;

length str $32;

array items{*} one -- three;

do i = 1 to dim(items);

  substr(str,i,1) = put(items{i}, 1.);

  end;

run;

proc sql;

create table want2 as

select a.*, b.client

from

  condChar as a left join

  sourceChar as b on b.str like a.str;

quit;

But proc SQL warns that it cannot optimize this join.

PG

PG
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 276 views
  • 3 likes
  • 3 in conversation