Re: How to optimise two SQL inner joins and a union?

al15 · Posted 10-30-2023 02:07 PM

Hi,

I have 3 datasets (with millions of observations). The masterfile contains a variable "record" that links it with the other datasets. The other 2 datasets contain the same variables.

I can get my rough code to work, but ideally the merge would occur in one SQL step, if possible. Even in the individual SQL steps I receive a warning "WARNING: Variable record already exists on file WORK.WANT1." & "WARNING: Variable record already exists on file WORK.WANT2."

I would really appreciate help optimising this code as I am new to SQL.

Thank you!

data a;
input record type;
datalines;
1 5
2 8
3 2
4 1
5 100
6 21
;
run;

data b;
input record treat;
datalines;
1 8
5 9
7 12
8 200
;
run;

data c;
input record treat;
datalines;
2 10
3 8
9 11
10 200
;
run;

data want;
input record type treat;
datalines;
1 5 8
2 8 10
3 2 8
5 100 9
;
run;

/*This rough code generates the dataset that I want*/
proc sql;
create table want1 as
select * from a as A
inner join b
on a.record=b.record
;
quit;

proc sql;
create table want2 as
select * from a as A
inner join c
on a.record=c.record
;
quit;

proc sql;
    create table want_test as
    select * from want1
    union
    select * from want2;
quit;

Kurt_Bremser · Posted 10-30-2023 02:29 PM

DO NOT USE THE ASTERISK!

When doing joins, always use an explicit list of variables, and use table aliases to make sure from which dataset a variable is taken.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 10-30-2023 07:36 PM

If each dataset is sorted by RECORD, and each dataset has no more than 1 record per RECORD, then you have a much simpler solution available in the DATA step:

data want;
  merge a (in=ina)  b (in=inb)  c (in=inc);
  by record;
  if ina=1  and (inb=1 or inc=1);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

al15 · Posted 10-31-2023 04:26 AM

Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work.

ballardw · Posted 10-31-2023 10:56 AM

@al15 wrote:

Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work.

SQL still has to order the data for any join on criteria. Using Proc SQL when the data isn't sorted just means that SQL runs slower.

You might want to investigate use of INDEXes with your variables used for matching if used frequently as the indexes may speed up things in SQL. But the step adding/updating the index might take some time to run.

mkeintz · Posted 10-31-2023 01:19 PM

You can use DATA step with hash objects to avoid sorts.

If A, B, and C each have only one obs per record value (and B and C have no common RECORD values), then you can build a hash object (i.e. a "lookup table") keyed on RECORD, and containing TYPE, from dataset A. Then read B and C and keep if they match any RECORD key in the hash object:

data want;
  set b c;
  if _n_=1 then do;
    if 0 then set a;
    declare hash ha (dataset:'a');
      ha.definekey('record');
      ha.definedata(all:'Y');
      ha.definedone();
  end;
  if ha.find()=0;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 10-31-2023 07:37 PM

A data step with hash table lookups would likely perform best as long as you've got enough memory for the hash tables.

Below code that combines everything into a single SQL. This will still require implicit sorting for the joins. The main difference to your code is the union ALL and combining the data prior to the inner join - which avoids multiple sorts of table a.

Without the ALL keyword the union will dedup the result set which requires sorting. If you can be sure that you don't have overlaps in your tables b and c then such a dedup is not required.

proc sql;
  create table want as
  select 
    l.record,
    l.type,
    r.treat
  from
    a l
    inner join
    (
      select record, treat
      from b
      union all
      select record, treat
      from c
    ) r
    on l.record=r.record
  ;
quit;

How to optimise two SQL inner joins and a union?