Solved: Re: faster processing time for sql join

jenim514 · Posted 02-07-2017 02:45 PM

Hi all! Is there a more efficient way to write this sql join? The tables I am querrying are not that large, but the processing time is taking forever. Basically, I have unique ids in table A (patid_copy). I want to pull all records from b,c,d,e,f - that have matching id (patid_copy=patuniq). Thanks!

proc sql;
create table asthma.missing_sidr as
select a.*,b.*, c.*, d.*,e.*,f.*
from patid as a
,sidr.mdrsidrfy10 as b
,sidr.mdrsidrfy11 as c
,sidr.mdrsidrfy12 as d
,sidr.mdrsidrfy13 as e
,sidr.mdrsidrfy14 as f
where a.patid_copy= b.patuniq or
a.patid_copy= c.patuniq or
a.patid_copy= d.patuniq or
a.patid_copy= e.patuniq or
a.patid_copy= f.patuniq;
quit;

mkeintz · Posted 02-07-2017 09:36 PM

You apparently wnat all the records from b,c,d, or e whose PATID appears in A. And since you say they all have the same vars, I guess you don't need any data from A other than PATID. If so, then

data want;

if _n_=1 then do;

if 0 then set A;
declare hash h (dataset:'A (keep=patid)');

h.definekey('patid');

h.definedone();

end;

set b c d e;

rc=h.find();

Or perhaps you want the intersection:

data want;

if _n_=1 then do;

if 0 then set a b c d e;

declare hash ha (dataset:'a (keep=patid)'); ha.definekey('patid'); ha.definedone();

declare hash hb (dataset:'b (keep=patid)'); hb.definekey('patid'); hb.definedone();

declare hash hc (dataset:'c (keep=patid)'); hc.definekey('patid'); hc.definedone();

declare hash hd (dataset:'d (keep=patid)'); hd.definekey('patid'); hd.definedone();

declare hash he (dataset:'e (keep=patid)'); he.definekey('patid'); he.definedone();

end;

set b c d e;

rc=ha.find();

if rc=0 then rc=hb.find();

if rc=0 then rc=hc.find();

if rc=0 then rc=hd.find();

if rc=0 then rc=he.find();

if rc=0;

run;

Note this does not "join" B,C,D, and E. I think you intended an OUTE UNION of them, filtered for PATID found in A.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Kurt_Bremser · Posted 02-07-2017 02:53 PM

What is the relationship for the key variables? Are those unique in all datasets, or do you have m:n relationships?

Depending on that, I'd sort and use a data step merge.

It is also important to know which variables are present in the datasets. Variables common in several datasets might cause lost data.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jenim514 · Posted 02-07-2017 03:00 PM

all the variables in datasets are the same. so I should stack the tables in a data step and then pull out the observations needed using sql?

ballardw · Posted 02-07-2017 03:34 PM

AS written such as:

from patid as a
,sidr.mdrsidrfy10 as b

This does a cartesian join or every record in Patid with every record in sidr.mdrsidrfy10 .

Each added data set is crossed with every single other record of the others.

So you have a*b*c*d*e*f numbers of records, if each of those data sets has 100 records that mean you have 1E12 generated records. Then each is compared to the where criteria to see if the result is kept.

So combining the mdrsidrfy sets with a data step Set statement and then maybe:

Proc sql;

create table want as

select a.*, b.*

from patid as a left join combined as b

on a.a.patid_copy=b.a.patid_copy;

quit;

mkeintz · Posted 02-07-2017 09:36 PM

You apparently wnat all the records from b,c,d, or e whose PATID appears in A. And since you say they all have the same vars, I guess you don't need any data from A other than PATID. If so, then

data want;

if _n_=1 then do;

if 0 then set A;
declare hash h (dataset:'A (keep=patid)');

h.definekey('patid');

h.definedone();

end;

set b c d e;

rc=h.find();

Or perhaps you want the intersection:

data want;

if _n_=1 then do;

if 0 then set a b c d e;

declare hash ha (dataset:'a (keep=patid)'); ha.definekey('patid'); ha.definedone();

declare hash hb (dataset:'b (keep=patid)'); hb.definekey('patid'); hb.definedone();

declare hash hc (dataset:'c (keep=patid)'); hc.definekey('patid'); hc.definedone();

declare hash hd (dataset:'d (keep=patid)'); hd.definekey('patid'); hd.definedone();

declare hash he (dataset:'e (keep=patid)'); he.definekey('patid'); he.definedone();

end;

set b c d e;

rc=ha.find();

if rc=0 then rc=hb.find();

if rc=0 then rc=hc.find();

if rc=0 then rc=hd.find();

if rc=0 then rc=he.find();

if rc=0;

run;

Note this does not "join" B,C,D, and E. I think you intended an OUTE UNION of them, filtered for PATID found in A.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jenim514 · Posted 02-09-2017 03:10 PM

@mkeintz The first hash work great and quick!! Thank you!

anoopmohandas7 · Posted 02-07-2017 03:39 PM

An inner join should process it faster.

proc sql;
create table  asthma.missing_sidr as
select a.*,b.*, c.*, d.*,e.*,f.*
from  patid as a 
inner join  sidr.mdrsidrfy10 as b on a.patid_copy= b.patuniq
inner join sidr.mdrsidrfy11 as c on a.patid_copy= c.patuniq
inner join sidr.mdrsidrfy12 as d on a.patid_copy= d.patuniq
inner join sidr.mdrsidrfy13 as e on a.patid_copy= e.patuniq
inner join sidr.mdrsidrfy14 as f on .patid_copy= f.patuniq;
quit ;

Astounding · Posted 02-07-2017 04:44 PM

The SQL experts out there will need to fix this, but I think if you rely on no patients appearing in multiple data sets you can use something like this.

proc sql;

create table want as

select * from sidr.mdrsidrfy10

union

select * from sidr.mdrsidrfy11

union

select * from sidr.mdrsidrfy12

union

select * from sidr.mdrsidrfy13

where patuniq in (select patid_copy from patid);

quit;

Classroom Training Available!