BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JMD
Calcite | Level 5 JMD
Calcite | Level 5

Good morning,

I have to merge 2 datasets (one-to-many) on a long list of condition, more than 20 factors. So if I have to use Proc SQL, the section of "ON a.var=b.var" will be very long.

I wonder if there is any better way to do the merge both in terms of the code and the speed.

Many thanks,

JHN

data condition;
input time one two three ;
datalines;
1 3 4 5
2 . 6 8
3 2 . .
4 9 6 .
;;;;

data source;
input number_client one two three ;
datalines;
7 3 4 5
8 3 6 8
9 2 5 6
10 2 6 9
11 3 6 8
12 3 6 9
;;
run;

proc sql;
create table want
as select * from condition as a left join source as b
on (a.one = b.one or a.one is null)
and (a.two = b.two or a.two is null)
and (a.three = b.three or a.three is null)
; quit;

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Joining on OR conditions is always problematic. You could try going from wide to long before the query :

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condList(keep=time i item);

set condition;

array items{*} one -- three;

do i = 1 to dim(items);

  item = items{i};

  if not missing(item) then output;

  end;

run;

data sourceList(keep=client i item);

set source;

array items{*} one -- three;

do i = 1 to dim(items);

  item = items{i};

  output;

  end;

run;

proc sql;

create table timeClient as

select time, client

from

  condList as a inner join

  sourceList as b on a.i=b.i

group by time, client

having min(a.item=b.item)=1;

create table want as

select c.*, client

from condition as c natural left join timeClient;

quit;

PG

PG

View solution in original post

5 REPLIES 5
PGStats
Opal | Level 21

Joining on OR conditions is always problematic. You could try going from wide to long before the query :

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condList(keep=time i item);

set condition;

array items{*} one -- three;

do i = 1 to dim(items);

  item = items{i};

  if not missing(item) then output;

  end;

run;

data sourceList(keep=client i item);

set source;

array items{*} one -- three;

do i = 1 to dim(items);

  item = items{i};

  output;

  end;

run;

proc sql;

create table timeClient as

select time, client

from

  condList as a inner join

  sourceList as b on a.i=b.i

group by time, client

having min(a.item=b.item)=1;

create table want as

select c.*, client

from condition as c natural left join timeClient;

quit;

PG

PG
Patrick
Opal | Level 21

I've run a few coding variations (inclusive Pierre's one) using fullstimer. The results were not conclusive and you will have to test with your real data.

What I don't understand: You say the relationships between your tables are "one to many" and this is also the case in the test data provided. In your real data: Couldn't it be that over time a patient has more than one condition? If so then the relationship would be many to many.

Eg. you give us this row in the condition data set: "3 2 . .". Does this mean that the "2" won't be used anywhere else, eg. for something like "3 2 . 6" ?

You're having "time" in your condition data set. In case it's many to many and you would only want to pick the latest matching condition then a data step approach would beat any SQL as you could stop comparisons as soon as there is a match (having the condition dataset sorted by descending time).

About the 20 factors and typing: You could always write a little macro which generates the repetitive code.

And here the code variations. On my laptop there is not really a clear "winner" - but as said: This might look very different when running stuff with real data.

options fullstimer;

data condition;
  input time one two three;
  datalines;
1 3 4 5
2 . 6 8
3 2 . .
4 9 6 .
;;;;

data source;
  do number_client=1 to 1000000;
    one=ceil(ranuni(1)*9);
    two=ceil(ranuni(1)*9);
    three=ceil(ranuni(1)*9);
    output;
  end;
  stop;
run;

/* variant 1  */
proc sql feedback;
  create table want1 as
    select
      a.time,
      a.one as a_one, b.one as b_one,
      a.two as a_two, b.two as b_two,
      a.three as a_three, b.three as b_three,
      b.number_client
    from condition as a left join source as b
      on (a.one = b.one or a.one is null)
      and (a.two = b.two or a.two is null)
      and (a.three = b.three or a.three is null)
  ;
quit;

/* variant 2  */
proc sql feedback;
  create table want2 as
    select
      a.time,
      a.one as a_one, b.one as b_one,
      a.two as a_two, b.two as b_two,
      a.three as a_three, b.three as b_three,
      b.number_client
    from condition as a left join source as b
      on (
      b.one =coalesce(a.one,b.one)
      and b.two =coalesce(a.two,b.two)
      and b.three =coalesce(a.three,b.three)
      )
  ;
quit;

/* variant 3  */
sasfile work.condition load;

data want3(drop=_:);
  set source
    (rename=(one=_one two=_two three=_three))
  ;

  do _i=1 to nobs;
    set condition nobs=nobs point=_i;
      if
      _one =coalesce(one,_one)
      and _two =coalesce(two,_two)
      and _three =coalesce(three,_three) then
    output;
  end;
run;

sasfile work.condition close;

/* variant 4  */
data condList(keep=time i item);
  set condition;
  array items{*} one -- three;

  do i = 1 to dim(items);
    item = items{i};

    if not missing(item) then
      output;
  end;
run;

data sourceList(keep=number_client i item);
  set source;
  array items{*} one -- three;

  do i = 1 to dim(items);
    item = items{i};
    output;
  end;
run;

proc sql;
  create table timeClient as

  select time, number_client

  from

    condList as a inner join

    sourceList as b on a.i=b.i

  group by time, number_client

  having min(a.item=b.item)=1;
  create table want as

  select c.*, number_client

  from condition as c natural left join timeClient;
quit;

JMD
Calcite | Level 5 JMD
Calcite | Level 5

Hi PR and Patric.

Thank you for you help.

They work perfectly.

JHN

Patrick
Opal | Level 21

If there are not too many different codes then you could also expand your condition table so that it contains all the combinations (and no more missings). This would then allow to use a hash lookup.

PGStats
Opal | Level 21

Theoretically, the join could also be based on the LIKE operator:

data condition;

input time one two three ;

datalines;

1 3 4 5

2 . 6 8

3 2 . .

4 9 6 .

;

data source;

input client one two three ;

datalines;

7 3 4 5

8 3 6 8

9 2 5 6

10 2 6 9

11 3 6 8

12 3 6 9

;

data condChar;

set condition;

length str $32;

array items{*} one -- three;

do i = 1 to dim(items);

  if missing(items{i}) then

  substr(str,i,1) = "_";

  else

  substr(str,i,1) = put(items{i}, 1.);

  end;

run;

data sourceChar;

set source;

length str $32;

array items{*} one -- three;

do i = 1 to dim(items);

  substr(str,i,1) = put(items{i}, 1.);

  end;

run;

proc sql;

create table want2 as

select a.*, b.client

from

  condChar as a left join

  sourceChar as b on b.str like a.str;

quit;

But proc SQL warns that it cannot optimize this join.

PG

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 669 views
  • 3 likes
  • 3 in conversation