DATA Step, Macro, Functions and more

Is there an alternative to in( ) in Proc SQL?

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 110
Accepted Solution

Is there an alternative to in( ) in Proc SQL?

[ Edited ]

Disclaimer: Thanks to @Ksharp for providing extensive help on this issue, the speed of the code increased dramatically (saved several hours). @mkeintz has also provided an alternative solution using PROC SQL. If one wants to use PROC SQL, please see @mkeintz's solution (with the having clause) or @Ksharp's solution. Given that @Ksharp answered first, I had to accept his first. Both thanks so much to both of you!


Question starts here:

 

I am trying to find an alternative way to write the following code by not relying on in:

 

 

 

proc sql;
	create table work.sample1 as
	select a.*, b.*
	from WORK.IMPORT2 as a 
	right join work.portfolio as b
	on a.mgrno=b.mgrno
	where a.fdate in(
		select max(fdate) as last_fdate
		from WORK.IMPORT2
		where a.mgrno=b.mgrno and fdate < Announcement);
quit;

 

The dataset "work.portfolio" contains entries as follows:

 

mgrnoannouncement
93859/16/09
628909/14/08


The dataset "WORK.IMPORT2" contains entries as follows:

 

mgrnofdatecusip
938512/31/0712345
938512/31/0812345
938512/31/0912345
938512/31/0723456
938512/31/0823456
938512/31/0734567
938512/31/0778911
938512/31/0878912
628909/13/0834567
628909/13/0712345
628909/13/1122345

 

 

I have the following two questions:

Q1: Is there an alternative (and more efficient) way of writing the code stated above without using IN( )? In particular, I am trying to select the most recent fdates that are before announcement.

Q2: How can I select the most recent fdates by cusip and mgrno such that fdate < announcement? In Q1, I am only concerned about the most recent last fdates by MGRNO, but here I also want to see how to break down this specification into MGRNO and CUSIP. 

Thanks in advance for any help / suggestions. 


Accepted Solutions
Solution
‎06-01-2017 03:18 PM
Super User
Posts: 10,048

Re: Is there an alternative to in( ) in Proc SQL?

If you have multiple date for the same mgrno. 

Try this one .

 

 

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
9385	9/16/08
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
9385	12/31/06	78912
9385	12/31/06	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno descending announcement;
run;
proc sort data=import2;
by mgrno descending fdate;
run;
data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno descending fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;
data want;
 set temp(where=(want is not missing));
 by mgrno want fdate notsorted;
 if first.want then n=0;
 n+first.fdate;
 if n=1;
run;

View solution in original post


All Replies
Super User
Super User
Posts: 7,997

Re: Is there an alternative to in( ) in Proc SQL?

Well, I have no test data to try this on, but maybe something like:

proc sql;
	create table work.sample1 as
	select a.*, b.*
	from WORK.IMPORT2 as a 
	right join work.portfolio as b
	on a.mgrno=b.mgrno
group by a.mgrno
having b.fdate=max(b.fdate); quit;
Frequent Contributor
Posts: 110

Re: Is there an alternative to in( ) in Proc SQL?

Thanks, @RW9. With your code, however, I cannot compare the max(fdate) with the announcement. I was trying to use "having" too, but I got stuck when I tried to select the max(fdate) such that fdate<announcement. In other words, how would it be possible to use "having" so that I can select the most recent fdate [i.e., max(fdate)] from the fdates that are less than announcement?

Super User
Super User
Posts: 7,997

Re: Is there an alternative to in( ) in Proc SQL?

Post some test data in the form of a datastep, and what the output should look like.  

Trusted Advisor
Posts: 1,022

Re: Is there an alternative to in( ) in Proc SQL?


Yegen wrote:

Thanks, @RW9. With your code, however, I cannot compare the max(fdate) with the announcement. I was trying to use "having" too, but I got stuck when I tried to select the max(fdate) such that fdate<announcement. In other words, how would it be possible to use "having" so that I can select the most recent fdate [i.e., max(fdate)] from the fdates that are less than announcement?


 

Include that date constraint prior to the "group by  ... having" clause:

 

proc sql;
  create table work.sample1 as
    select a.*, b.* from 
    WORK.IMPORT2 as a 
      right join
    work.portfolio as b
    on a.mgrno=b.mgrno
    where a.fdate<=b.announcement
    group by a.mgrno        having a.fdate=max(a.fdate);
quit;
Frequent Contributor
Posts: 110

Re: Is there an alternative to in( ) in Proc SQL?

@mkeintz, thanks for your suggestion. Smiley HappyI somehow missed to see your suggestion and just saw it. I had a very similar code before but it did not work. I gave your suggestion a try and it seems to work very well. Just to ask a clarification question, the following part of the code is executed after removing all rows that have fdate < announcement, right? 
  

    group by a.mgrno        having a.fdate=max(a.fdate);

I run the code on a subset of the dataset and the output looked correct. Before applying it to the larger dataset, I wanted to double check that the "having" clause is always run on the dataset that is created after the "where" clause (i.e., on the subset of the original data). 

 

Trusted Advisor
Posts: 1,022

Re: Is there an alternative to in( ) in Proc SQL?


Yegen wrote:

Just to ask a clarification question, the following part of the code is executed after removing all rows that have fdate < announcement, right? 


 

I think so.  But, in the experimental dataset, move the where clause to after the "group by ... having".  I suspect you will get fewer obserations, or maybe zero observations (if each mgr had at least one fdate>announcement).

Super User
Posts: 10,048

Re: Is there an alternative to in( ) in Proc SQL?

You did not post the output yet .

 

 

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno announcement;
run;
proc sort data=import2;
by mgrno fdate;
run;
data want;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;
 
Frequent Contributor
Posts: 110

Re: Is there an alternative to in( ) in Proc SQL?

Thanks for posting the data code, @Ksharp.

The output data would need to look like:

mgrnofdatecusipannouncement
938512/31/08123459/16/09
938512/31/08234569/16/09
938512/31/08789129/16/09
628909/13/08345679/14/08


These are the rows that have a max(fdate) that is smaller than the announcement. 

Super User
Posts: 10,048

Re: Is there an alternative to in( ) in Proc SQL?

OK. How about this one .

 

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno descending announcement;
run;
proc sort data=import2;
by mgrno descending fdate;
run;
data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno descending fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;
data want;
 set temp(where=(want is not missing));
 by mgrno fdate notsorted;
 if first.mgrno then n=0;
 n+first.fdate;
 if n=1;
run;
Super User
Posts: 10,048

Re: Is there an alternative to in( ) in Proc SQL?

if max(fdate) = announcement. 

 

You can reverse the position of these two tables.

set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);

-->

set  import2(in=inb) portfolio(rename=(announcement=fdate) in=ina);

 

Frequent Contributor
Posts: 110

Re: Is there an alternative to in( ) in Proc SQL?

@Ksharp, you just blow my mind with the two codes. They work magically well! Before testing it on the actual dataset (with close to a billion observations), I have two quick questions.

Q1: Could you please elaborate what you are doing in the following step (especially the set statement):

data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
by mgrno descending fdate; retain want; if first.mgrno then call missing(want); if ina then want=fdate; if inb; format want mmddyy10.; run;

By looking at the output, I can infer that you are adding the announcement dates to the fdates where the fdate < announcement date. Then in the last part of the code (not pasted in this reply, but is in your post), you are keeping only the observations that are the most recent fdates. 

Q2: What's the difference between reversing the position of the two tables? Which one would it be more efficient if there is a lot of redundant / unwanted data in the table that has the fdate (i.e., import2). 

Thanks again, I will give it a try on the actual dataset. 

Frequent Contributor
Posts: 110

Re: Is there an alternative to in( ) in Proc SQL?

[ Edited ]

@Ksharp would it also be possible to skip the following sorting part of the code: 

proc sort data=import2;
by mgrno descending fdate;
run;

The import2 dataset is the enormously large one and I am retrieving it from another server (so it's not on my SAS Studio or local disk). In fact, when I try to do so, I get the following error message:

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         proc sort data=mbp.t21 force;
 63         by mgrno descending fdate;
 64         run;
 
 ERROR: User does not have appropriate authorization level for library MBP.
 NOTE: The SAS System stopped processing this step because of errors.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.00 seconds
       cpu time            0.00 seconds
       
 65         
 66         
 67         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

 

Super User
Posts: 10,048

Re: Is there an alternative to in( ) in Proc SQL?

NO. You can't . But if your data has been sorted by these two variables. You could try to skip these proc sort and see what happened.

 

Alternative way is making a copy and proc sort .

 

data copy ;

 set mbp.t21 ;

run;

proc sort data=copy............

Super User
Posts: 10,048

Re: Is there an alternative to in( ) in Proc SQL?

1) SET + BY 

 is crossing insert obs according BY varibale.

if you delete "if inb;" you will see what happen.

 

E.X.

TableA

date

2010-01-02

2012-02-02

 

TableB

date

2010-01-04

2012-02-01

 

After SET+BY 

data want;

 set A B;

 by date;

run;

 

You will get:

2010-01-02  <--From A

2010-01-04

2012-02-01

2012-02-02 <--From B

 

2)fdate = announcement ,and you don't want this fdate as the max(most recent) fdate,

you could reverse the position of these two table.

E.X.

After SET+BY , you get this

2010-01-04

2012-02-02

2012-02-02 <--From Portfolio

 

if you don't want 2012-02-02 as the most recent fdate, change  position of these two table ,You will got:

2010-01-04

2012-02-02 <--From Portfolio

2012-02-02 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 23 replies
  • 236 views
  • 10 likes
  • 5 in conversation