topic Re: How to optimise two SQL inner joins and a union? in SAS Programming

How to optimise two SQL inner joins and a union?

al15 — Mon, 30 Oct 2023 18:07:05 GMT

Hi,

I have 3 datasets (with millions of observations). The masterfile contains a variable "record" that links it with the other datasets. The other 2 datasets contain the same variables.

I can get my rough code to work, but ideally the merge would occur in one SQL step, if possible. Even in the individual SQL steps I receive a warning "WARNING: Variable record already exists on file WORK.WANT1." & "WARNING: Variable record already exists on file WORK.WANT2."

I would really appreciate help optimising this code as I am new to SQL.

Thank you!

data a;
input record type;
datalines;
1 5
2 8
3 2
4 1
5 100
6 21
;
run;

data b;
input record treat;
datalines;
1 8
5 9
7 12
8 200
;
run;

data c;
input record treat;
datalines;
2 10
3 8
9 11
10 200
;
run;

data want;
input record type treat;
datalines;
1 5 8
2 8 10
3 2 8
5 100 9
;
run;

/*This rough code generates the dataset that I want*/
proc sql;
create table want1 as
select * from a as A
inner join b
on a.record=b.record
;
quit;

proc sql;
create table want2 as
select * from a as A
inner join c
on a.record=c.record
;
quit;

proc sql;
    create table want_test as
    select * from want1
    union
    select * from want2;
quit;

Re: How to optimise two SQL inner joins and a union?

Kurt_Bremser — Mon, 30 Oct 2023 18:29:31 GMT

DO NOT USE THE ASTERISK!

When doing joins, always use an explicit list of variables, and use table aliases to make sure from which dataset a variable is taken.

Re: How to optimise two SQL inner joins and a union?

mkeintz — Mon, 30 Oct 2023 23:36:28 GMT

If each dataset is sorted by RECORD, and each dataset has no more than 1 record per RECORD, then you have a much simpler solution available in the DATA step:

data want;
  merge a (in=ina)  b (in=inb)  c (in=inc);
  by record;
  if ina=1  and (inb=1 or inc=1);
run;

Re: How to optimise two SQL inner joins and a union?

al15 — Tue, 31 Oct 2023 08:26:31 GMT

Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work.

Re: How to optimise two SQL inner joins and a union?

ballardw — Tue, 31 Oct 2023 14:56:25 GMT

@al15 wrote:

Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work.

SQL still has to order the data for any join on criteria. Using Proc SQL when the data isn't sorted just means that SQL runs slower.

You might want to investigate use of INDEXes with your variables used for matching if used frequently as the indexes may speed up things in SQL. But the step adding/updating the index might take some time to run.

Re: How to optimise two SQL inner joins and a union?

mkeintz — Wed, 01 Nov 2023 02:00:27 GMT

You can use DATA step with hash objects to avoid sorts.

If A, B, and C each have only one obs per record value (and B and C have no common RECORD values), then you can build a hash object (i.e. a "lookup table") keyed on RECORD, and containing TYPE, from dataset A. Then read B and C and keep if they match any RECORD key in the hash object:

data want;
  set b c;
  if _n_=1 then do;
    if 0 then set a;
    declare hash ha (dataset:'a');
      ha.definekey('record');
      ha.definedata(all:'Y');
      ha.definedone();
  end;
  if ha.find()=0;
run;

Re: How to optimise two SQL inner joins and a union?

Patrick — Tue, 31 Oct 2023 23:40:54 GMT

A data step with hash table lookups would likely perform best as long as you've got enough memory for the hash tables.

Below code that combines everything into a single SQL. This will still require implicit sorting for the joins. The main difference to your code is the union ALL and combining the data prior to the inner join - which avoids multiple sorts of table a.

Without the ALL keyword the union will dedup the result set which requires sorting. If you can be sure that you don't have overlaps in your tables b and c then such a dedup is not required.

proc sql;
  create table want as
  select 
    l.record,
    l.type,
    r.treat
  from
    a l
    inner join
    (
      select record, treat
      from b
      union all
      select record, treat
      from c
    ) r
    on l.record=r.record
  ;
quit;