BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jenim514
Pyrite | Level 9

Hi!  Looking for some help for fixing my proc sql syntax... 

I have five tables representing a year (FY10-FY14) of health encounters for all patients during that specific year.

I need to identify the cohort of patients that recieved care in each of those five years.  So first health encounter should be in FY10 and last encounter should be in FY14 but the patient should also have encounters in FY12 and FY13.   This is my syntax used, but when I check the output at the end, the first encounter for many patients were not in FY10.  Any help is appreciated!!

proc sql;
create table CRG.ASTHMA_FY10to14_1 as
select a.*, b.*,c.*,d.*,e.*
from  CRG.ASTHMA_fy10_crg1 as a, CRG.ASTHMA_fy11_crg1  as b, CRG.ASTHMA_fy12_crg1 as c, CRG.ASTHMA_fy13_crg1 as d, CRG.ASTHMA_fy14_crg1 as e
where a.patientid=b.patientid=c.patientid=d.patientid=e.patientid;
quit;
*Identify patients that are in all fiscal years. 43152 rows and 22 columns;

proc sql;
create table CRG.ASTHMA_FY10to14_All_Cohort as
select a.*, b.*
from CRG.ASTHMA_FY10to14_1 as a, asthma.FY10to14_Asthma_IPOP_PDTS_1 as b
where a.patientid=b.patientid;
quit;
*Join the patientid with the full medical record.  NOTE: Table CRG.ASTHMA_FY10TO14_ALL_COHORT created, with 4567056 rows and 113 columns;

data first_enc_check;
set CRG.ASTHMA_FY10to14_All_Cohort;
by patientid;
if first.patientid then output;
run;
*checked to make sure first.patientid was in FY10...and not all  were.
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

I would prefer to use set operations:

 

proc sql;

create table everyYear as
select patientId from CRG.ASTHMA_fy10_crg1
intersect
select patientId from CRG.ASTHMA_fy11_crg1
intersect
select patientId from CRG.ASTHMA_fy12_crg1
intersect
select patientId from CRG.ASTHMA_fy13_crg1
intersect
select patientId from CRG.ASTHMA_fy14_crg1;

create table CRG.ASTHMA_FY10to14_1 as
select * from CRG.ASTHMA_fy10_crg1 where patientId in (select patientId from everyYear)
union all corresponding
select * from CRG.ASTHMA_fy11_crg1 where patientId in (select patientId from everyYear)
union all corresponding
select * from CRG.ASTHMA_fy12_crg1 where patientId in (select patientId from everyYear)
union all corresponding
select * from CRG.ASTHMA_fy13_crg1 where patientId in (select patientId from everyYear)
union all corresponding
select * from CRG.ASTHMA_fy14_crg1 where patientId in (select patientId from everyYear)
order by patientId, myTimeVariable;

drop table everyYear;

quit;
PG

View solution in original post

10 REPLIES 10
Reeza
Super User

Use an inner join. 

 

I don't think you can list your where condition that way either you need a bunch of AND in between your equal signs. 

Here's a sketch of how it should look:

 

From table1 as a
Inner join table2 as b
On a.id=b.id
Inner join table3 as c
On a.id=c.id
....
Where *any other condition *;
Quit;

There's a post from a SAS trainer that explains why this is actually faster than your query which is a cross join that is then filtered. 

 

If if all this is gibberish try reading up on join types 🤓

ChrisNZ
Tourmaline | Level 20

@Reeza Any link for the SAS trainer post? I can't find it.

While it seems intuitive, my own quick benchmark can't confirm this.

 

data TAB1;

do I=1 to 100e6;

output;

end;

run;

data TAB2;

do I=1 to 200e6 by 2;

output;

end;

run;

create table OUT as

select tab1.*

from TAB1

     , TAB2

where tab1.I=tab2.I

   and tab1.I < 50e6;

real time           20.02 seconds

user cpu time       26.50 seconds

system cpu time     2.99 seconds

memory             1060985.07k

OS Memory           1084668.00k

Step Count      576   Switch Count 82

create table OUT as

select tab1.*

from TAB1

       inner join

     TAB2

       on tab1.I=tab2.I

       and tab1.I < 50e6;

real time           20.49 seconds

user cpu time       27.08 seconds

system cpu time     2.60 seconds

memory             1060985.95k

OS Memory           1084668.00k

Step Count       578 Switch Count 92

create table OUT as

select tab1.*

from TAB1

       inner join

     TAB2

       on tab1.I=tab2.I

where tab1.I < 50e6;

real time           20.24 seconds

user cpu time       27.42 seconds

system cpu time     2.76 seconds

memory             1060986.07k

OS Memory           1084668.00k

Step Count      580   Switch Count 92

 

Note that the optimizer could do a much better job; faster time and much less memory when using a dataset option:

 

create table OUT as

select tab1.*

from TAB1(where=(I < 50e6))

   , TAB2

where tab1.I=tab2.I

 

real time           20:10.43

user cpu time       16:54.70

system cpu time     3:15.51

memory             12688.53k

OS Memory           37428.00k

Step Count       577 Switch Count 82

create table OUT as

select tab1.*

from TAB1(where=(I < 50e6))

       inner join

     TAB2

       on tab1.I=tab2.I;

real time           20:00.49

user cpu time       16:47.42

system cpu time     3:12.92

memory             12688.53k

OS Memory           37428.00k

Step Count        579 Switch Count 90

 

The intersect and the subquery syntaxes are slower:

 

create table OUT as

   select tab1.*

   from TAB1

   where tab1.I < 50e6

intersect

   select tab2.*

   from TAB2 ;

real time           28.42 seconds

user cpu time       49.32 seconds

system cpu time     6.20 seconds

memory             1060660.10k

OS Memory           1083768.00k

Step Count      583 Switch Count 125

create table OUT as

select tab1.*

from TAB1

where I in(select I from TAB2)

and tab1.I < 50e6; 

real time           47:44.37

user cpu time       11:18.49

system cpu time     36:24.32

memory             5444475.43k

OS Memory           6238204.00k

Step Count      581 Switch Count 3794

 

More performance and benchmark fun in 

  

Reeza
Super User

Maybe I misunderstood the post. 

 

I prefer an explicit join for clarity when reviewing code 

 

Here's the post

http://blogs.sas.com/content/sastraining/2013/02/04/a-database-professionals-best-friend-2/

Reeza
Super User

If you did just want to fix your where condition it would be :

 

Where a.id=b.id and a.id=c.id and a.id=d.id and a.id=e.id;
PGStats
Opal | Level 21

I would prefer to use set operations:

 

proc sql;

create table everyYear as
select patientId from CRG.ASTHMA_fy10_crg1
intersect
select patientId from CRG.ASTHMA_fy11_crg1
intersect
select patientId from CRG.ASTHMA_fy12_crg1
intersect
select patientId from CRG.ASTHMA_fy13_crg1
intersect
select patientId from CRG.ASTHMA_fy14_crg1;

create table CRG.ASTHMA_FY10to14_1 as
select * from CRG.ASTHMA_fy10_crg1 where patientId in (select patientId from everyYear)
union all corresponding
select * from CRG.ASTHMA_fy11_crg1 where patientId in (select patientId from everyYear)
union all corresponding
select * from CRG.ASTHMA_fy12_crg1 where patientId in (select patientId from everyYear)
union all corresponding
select * from CRG.ASTHMA_fy13_crg1 where patientId in (select patientId from everyYear)
union all corresponding
select * from CRG.ASTHMA_fy14_crg1 where patientId in (select patientId from everyYear)
order by patientId, myTimeVariable;

drop table everyYear;

quit;
PG
jenim514
Pyrite | Level 9

@PGStatsWhat type of union or join would allow me to keep all the variables found in each CRG.asthma table (in your second step)? Each table has unique variables assigned to that Fiscal Year. It seems the Union All just keeps those variables that are the same in each table. Thanks!!

PGStats
Opal | Level 21

Replace union all corresponding with outer union corresponding

PG
jenim514
Pyrite | Level 9

@PGStats one more thing!  The output is long, and I need to make wide...i.e.
 one Patientid and all corresponding FY variables on one line.  

my current output looks like
datalines;


ID 10_CRG  11_CRG 12_CRG 13_CRG 14_CRG
1 2
1 . 4
1 . . 3
1 . . . 2
1 . . . . 3

 

want to look like

 

1 2 4 3 2 3

 

So, a transpose I think, but i have never used transpose and it is transpoing weird output.

 

 

PGStats
Opal | Level 21

Why wide? Why not

 

ID YEAR CRG
1 2010 2
1 2011 4
1 2012 3
1 2013 2
1 2014 3

 

you would save yourself so much trouble.

 

PG
jenim514
Pyrite | Level 9

@PGStats  Can you gelp me get to the result you recomended?  I can work with a table that is formatted the way you suggested.

 

My transpose attempt:

 

proc transpose data=CRG.ASTHMA_FY10to14_F out=long prefix=CRG;
by patientid;
var concurrent_fy10-concurrent_fy14;
run;

 

resulted in this (patientid hidden by me):

 

Obs _NAME_ CRG1 CRG2 CRG3 CRG4 CRG5
1 CONCURRENT_FY10 10000 10000      
2 CONCURRENT_FY11     51381 51381  
3 CONCURRENT_FY12         51381
4 CONCURRENT_FY13          
5 CONCURRENT_FY14          
             
             
             
             
             

 

 

How can I get to this type of table:

 

ID YEAR CRG
1 2010 10000

1 2010 10000
1 2011  51381
1 2011  51381
1 2012  51381
1 2013  .

1 2014  .

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1415 views
  • 0 likes
  • 4 in conversation