Solved: Re: Is there an alternative to in( ) in Proc SQL?

Yegen · Posted 05-31-2017 06:13 AM

Disclaimer: Thanks to @Ksharp for providing extensive help on this issue, the speed of the code increased dramatically (saved several hours). @mkeintz has also provided an alternative solution using PROC SQL. If one wants to use PROC SQL, please see @mkeintz's solution (with the having clause) or @Ksharp's solution. Given that @Ksharp answered first, I had to accept his first. Both thanks so much to both of you!

Question starts here:

I am trying to find an alternative way to write the following code by not relying on in:

proc sql;
	create table work.sample1 as
	select a.*, b.*
	from WORK.IMPORT2 as a 
	right join work.portfolio as b
	on a.mgrno=b.mgrno
	where a.fdate in(
		select max(fdate) as last_fdate
		from WORK.IMPORT2
		where a.mgrno=b.mgrno and fdate < Announcement);
quit;

The dataset "work.portfolio" contains entries as follows:

mgrno	announcement
9385	9/16/09
62890	9/14/08

The dataset "WORK.IMPORT2" contains entries as follows:

mgrno	fdate	cusip
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345

I have the following two questions:

Q1: Is there an alternative (and more efficient) way of writing the code stated above without using IN( )? In particular, I am trying to select the most recent fdates that are before announcement.

Q2: How can I select the most recent fdates by cusip and mgrno such that fdate < announcement? In Q1, I am only concerned about the most recent last fdates by MGRNO, but here I also want to see how to break down this specification into MGRNO and CUSIP.

Thanks in advance for any help / suggestions.

Ksharp · Posted 06-01-2017 08:37 AM

If you have multiple date for the same mgrno.

Try this one .

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
9385	9/16/08
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
9385	12/31/06	78912
9385	12/31/06	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno descending announcement;
run;
proc sort data=import2;
by mgrno descending fdate;
run;
data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno descending fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;
data want;
 set temp(where=(want is not missing));
 by mgrno want fdate notsorted;
 if first.want then n=0;
 n+first.fdate;
 if n=1;
run;

View solution in original post

RW9 · Posted 05-31-2017 06:45 AM

Well, I have no test data to try this on, but maybe something like:

proc sql;
	create table work.sample1 as
	select a.*, b.*
	from WORK.IMPORT2 as a 
	right join work.portfolio as b
	on a.mgrno=b.mgrno
        group by a.mgrno
        having b.fdate=max(b.fdate);
quit;

Yegen · Posted 05-31-2017 06:58 AM

Thanks, @RW9. With your code, however, I cannot compare the max(fdate) with the announcement. I was trying to use "having" too, but I got stuck when I tried to select the max(fdate) such that fdate<announcement. In other words, how would it be possible to use "having" so that I can select the most recent fdate [i.e., max(fdate)] from the fdates that are less than announcement?

RW9 · Posted 05-31-2017 07:58 AM

Post some test data in the form of a datastep, and what the output should look like.

mkeintz · Posted 05-31-2017 09:54 AM

@Yegen wrote:

Thanks, @RW9. With your code, however, I cannot compare the max(fdate) with the announcement. I was trying to use "having" too, but I got stuck when I tried to select the max(fdate) such that fdate<announcement. In other words, how would it be possible to use "having" so that I can select the most recent fdate [i.e., max(fdate)] from the fdates that are less than announcement?

Include that date constraint prior to the "group by ... having" clause:

proc sql;
  create table work.sample1 as
    select a.*, b.* from 
    WORK.IMPORT2 as a 
      right join
    work.portfolio as b
    on a.mgrno=b.mgrno
    where a.fdate<=b.announcement
    group by a.mgrno        having a.fdate=max(a.fdate);
quit;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Yegen · Posted 05-31-2017 11:17 AM

@mkeintz, thanks for your suggestion. I somehow missed to see your suggestion and just saw it. I had a very similar code before but it did not work. I gave your suggestion a try and it seems to work very well. Just to ask a clarification question, the following part of the code is executed after removing all rows that have fdate < announcement, right?

    group by a.mgrno        having a.fdate=max(a.fdate);

I run the code on a subset of the dataset and the output looked correct. Before applying it to the larger dataset, I wanted to double check that the "having" clause is always run on the dataset that is created after the "where" clause (i.e., on the subset of the original data).

mkeintz · Posted 05-31-2017 11:29 AM

@Yegen wrote:

Just to ask a clarification question, the following part of the code is executed after removing all rows that have fdate < announcement, right?

I think so. But, in the experimental dataset, move the where clause to after the "group by ... having". I suspect you will get fewer obserations, or maybe zero observations (if each mgr had at least one fdate>announcement).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 05-31-2017 08:28 AM

You did not post the output yet .

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno announcement;
run;
proc sort data=import2;
by mgrno fdate;
run;
data want;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;

Yegen · Posted 05-31-2017 08:40 AM

Thanks for posting the data code, @Ksharp.

The output data would need to look like:

mgrno	fdate	cusip	announcement
9385	12/31/08	12345	9/16/09
9385	12/31/08	23456	9/16/09
9385	12/31/08	78912	9/16/09
62890	9/13/08	34567	9/14/08

These are the rows that have a max(fdate) that is smaller than the announcement.

Ksharp · Posted 05-31-2017 09:23 AM

OK. How about this one .

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno descending announcement;
run;
proc sort data=import2;
by mgrno descending fdate;
run;
data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno descending fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;
data want;
 set temp(where=(want is not missing));
 by mgrno fdate notsorted;
 if first.mgrno then n=0;
 n+first.fdate;
 if n=1;
run;

Ksharp · Posted 05-31-2017 09:39 AM

if max(fdate) = announcement.

You can reverse the position of these two tables.

set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);

-->

set  import2(in=inb) portfolio(rename=(announcement=fdate) in=ina);

Yegen · Posted 05-31-2017 10:53 AM

@Ksharp, you just blow my mind with the two codes. They work magically well! Before testing it on the actual dataset (with close to a billion observations), I have two quick questions.

Q1: Could you please elaborate what you are doing in the following step (especially the set statement):

data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno descending fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;

By looking at the output, I can infer that you are adding the announcement dates to the fdates where the fdate < announcement date. Then in the last part of the code (not pasted in this reply, but is in your post), you are keeping only the observations that are the most recent fdates.

Q2: What's the difference between reversing the position of the two tables? Which one would it be more efficient if there is a lot of redundant / unwanted data in the table that has the fdate (i.e., import2).

Thanks again, I will give it a try on the actual dataset.

Yegen · Posted 05-31-2017 11:12 AM

@Ksharp would it also be possible to skip the following sorting part of the code:

proc sort data=import2;
by mgrno descending fdate;
run;

The import2 dataset is the enormously large one and I am retrieving it from another server (so it's not on my SAS Studio or local disk). In fact, when I try to do so, I get the following error message:

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         proc sort data=mbp.t21 force;
 63         by mgrno descending fdate;
 64         run;
 
 ERROR: User does not have appropriate authorization level for library MBP.
 NOTE: The SAS System stopped processing this step because of errors.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.00 seconds
       cpu time            0.00 seconds
       
 65         
 66         
 67         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

Ksharp · Posted 06-01-2017 08:29 AM

NO. You can't . But if your data has been sorted by these two variables. You could try to skip these proc sort and see what happened.

Alternative way is making a copy and proc sort .

data copy ;

set mbp.t21 ;

run;

proc sort data=copy............

Ksharp · Posted 06-01-2017 08:26 AM

1) SET + BY

is crossing insert obs according BY varibale.

if you delete "if inb;" you will see what happen.

E.X.

TableA

date

2010-01-02

2012-02-02

TableB

date

2010-01-04

2012-02-01

After SET+BY

data want;

set A B;

by date;

run;

You will get:

2010-01-02 <--From A

2010-01-04

2012-02-01

2012-02-02 <--From B

2)fdate = announcement ,and you don't want this fdate as the max(most recent) fdate,

you could reverse the position of these two table.

E.X.

After SET+BY , you get this

2010-01-04

2012-02-02

2012-02-02 <--From Portfolio

if you don't want 2012-02-02 as the most recent fdate, change position of these two table ,You will got:

2010-01-04

2012-02-02 <--From Portfolio

2012-02-02

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away