Hi all! Is there a more efficient way to write this sql join? The tables I am querrying are not that large, but the processing time is taking forever. Basically, I have unique ids in table A (patid_copy). I want to pull all records from b,c,d,e,f - that have matching id (patid_copy=patuniq). Thanks!
proc sql;
create table asthma.missing_sidr as
select a.*,b.*, c.*, d.*,e.*,f.*
from patid as a
,sidr.mdrsidrfy10 as b
,sidr.mdrsidrfy11 as c
,sidr.mdrsidrfy12 as d
,sidr.mdrsidrfy13 as e
,sidr.mdrsidrfy14 as f
where a.patid_copy= b.patuniq or
a.patid_copy= c.patuniq or
a.patid_copy= d.patuniq or
a.patid_copy= e.patuniq or
a.patid_copy= f.patuniq;
quit;
You apparently wnat all the records from b,c,d, or e whose PATID appears in A. And since you say they all have the same vars, I guess you don't need any data from A other than PATID. If so, then
data want;
if _n_=1 then do;
if 0 then set A;
declare hash h (dataset:'A (keep=patid)');
h.definekey('patid');
h.definedone();
end;
set b c d e;
rc=h.find();
Or perhaps you want the intersection:
data want;
if _n_=1 then do;
if 0 then set a b c d e;
declare hash ha (dataset:'a (keep=patid)'); ha.definekey('patid'); ha.definedone();
declare hash hb (dataset:'b (keep=patid)'); hb.definekey('patid'); hb.definedone();
declare hash hc (dataset:'c (keep=patid)'); hc.definekey('patid'); hc.definedone();
declare hash hd (dataset:'d (keep=patid)'); hd.definekey('patid'); hd.definedone();
declare hash he (dataset:'e (keep=patid)'); he.definekey('patid'); he.definedone();
end;
set b c d e;
rc=ha.find();
if rc=0 then rc=hb.find();
if rc=0 then rc=hc.find();
if rc=0 then rc=hd.find();
if rc=0 then rc=he.find();
if rc=0;
run;
Note this does not "join" B,C,D, and E. I think you intended an OUTE UNION of them, filtered for PATID found in A.
What is the relationship for the key variables? Are those unique in all datasets, or do you have m:n relationships?
Depending on that, I'd sort and use a data step merge.
It is also important to know which variables are present in the datasets. Variables common in several datasets might cause lost data.
AS written such as:
from patid as a
,sidr.mdrsidrfy10 as b
This does a cartesian join or every record in Patid with every record in sidr.mdrsidrfy10 .
Each added data set is crossed with every single other record of the others.
So you have a*b*c*d*e*f numbers of records, if each of those data sets has 100 records that mean you have 1E12 generated records. Then each is compared to the where criteria to see if the result is kept.
So combining the mdrsidrfy sets with a data step Set statement and then maybe:
Proc sql;
create table want as
select a.*, b.*
from patid as a left join combined as b
on a.a.patid_copy=b.a.patid_copy;
quit;
You apparently wnat all the records from b,c,d, or e whose PATID appears in A. And since you say they all have the same vars, I guess you don't need any data from A other than PATID. If so, then
data want;
if _n_=1 then do;
if 0 then set A;
declare hash h (dataset:'A (keep=patid)');
h.definekey('patid');
h.definedone();
end;
set b c d e;
rc=h.find();
Or perhaps you want the intersection:
data want;
if _n_=1 then do;
if 0 then set a b c d e;
declare hash ha (dataset:'a (keep=patid)'); ha.definekey('patid'); ha.definedone();
declare hash hb (dataset:'b (keep=patid)'); hb.definekey('patid'); hb.definedone();
declare hash hc (dataset:'c (keep=patid)'); hc.definekey('patid'); hc.definedone();
declare hash hd (dataset:'d (keep=patid)'); hd.definekey('patid'); hd.definedone();
declare hash he (dataset:'e (keep=patid)'); he.definekey('patid'); he.definedone();
end;
set b c d e;
rc=ha.find();
if rc=0 then rc=hb.find();
if rc=0 then rc=hc.find();
if rc=0 then rc=hd.find();
if rc=0 then rc=he.find();
if rc=0;
run;
Note this does not "join" B,C,D, and E. I think you intended an OUTE UNION of them, filtered for PATID found in A.
An inner join should process it faster.
proc sql;
create table asthma.missing_sidr as
select a.*,b.*, c.*, d.*,e.*,f.*
from patid as a
inner join sidr.mdrsidrfy10 as b on a.patid_copy= b.patuniq
inner join sidr.mdrsidrfy11 as c on a.patid_copy= c.patuniq
inner join sidr.mdrsidrfy12 as d on a.patid_copy= d.patuniq
inner join sidr.mdrsidrfy13 as e on a.patid_copy= e.patuniq
inner join sidr.mdrsidrfy14 as f on .patid_copy= f.patuniq;
quit ;
The SQL experts out there will need to fix this, but I think if you rely on no patients appearing in multiple data sets you can use something like this.
proc sql;
create table want as
select * from sidr.mdrsidrfy10
union
select * from sidr.mdrsidrfy11
union
select * from sidr.mdrsidrfy12
union
select * from sidr.mdrsidrfy13
where patuniq in (select patid_copy from patid);
quit;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.