BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yegen
Pyrite | Level 9

Disclaimer: Thanks to @Ksharp for providing extensive help on this issue, the speed of the code increased dramatically (saved several hours). @mkeintz has also provided an alternative solution using PROC SQL. If one wants to use PROC SQL, please see @mkeintz's solution (with the having clause) or @Ksharp's solution. Given that @Ksharp answered first, I had to accept his first. Both thanks so much to both of you!


Question starts here:

 

I am trying to find an alternative way to write the following code by not relying on in:

 

 

 

proc sql;
	create table work.sample1 as
	select a.*, b.*
	from WORK.IMPORT2 as a 
	right join work.portfolio as b
	on a.mgrno=b.mgrno
	where a.fdate in(
		select max(fdate) as last_fdate
		from WORK.IMPORT2
		where a.mgrno=b.mgrno and fdate < Announcement);
quit;

 

The dataset "work.portfolio" contains entries as follows:

 

mgrnoannouncement
93859/16/09
628909/14/08


The dataset "WORK.IMPORT2" contains entries as follows:

 

mgrnofdatecusip
938512/31/0712345
938512/31/0812345
938512/31/0912345
938512/31/0723456
938512/31/0823456
938512/31/0734567
938512/31/0778911
938512/31/0878912
628909/13/0834567
628909/13/0712345
628909/13/1122345

 

 

I have the following two questions:

Q1: Is there an alternative (and more efficient) way of writing the code stated above without using IN( )? In particular, I am trying to select the most recent fdates that are before announcement.

Q2: How can I select the most recent fdates by cusip and mgrno such that fdate < announcement? In Q1, I am only concerned about the most recent last fdates by MGRNO, but here I also want to see how to break down this specification into MGRNO and CUSIP. 

Thanks in advance for any help / suggestions. 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

If you have multiple date for the same mgrno. 

Try this one .

 

 

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
9385	9/16/08
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
9385	12/31/06	78912
9385	12/31/06	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno descending announcement;
run;
proc sort data=import2;
by mgrno descending fdate;
run;
data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno descending fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;
data want;
 set temp(where=(want is not missing));
 by mgrno want fdate notsorted;
 if first.want then n=0;
 n+first.fdate;
 if n=1;
run;

View solution in original post

23 REPLIES 23
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, I have no test data to try this on, but maybe something like:

proc sql;
	create table work.sample1 as
	select a.*, b.*
	from WORK.IMPORT2 as a 
	right join work.portfolio as b
	on a.mgrno=b.mgrno
group by a.mgrno
having b.fdate=max(b.fdate); quit;
Yegen
Pyrite | Level 9

Thanks, @RW9. With your code, however, I cannot compare the max(fdate) with the announcement. I was trying to use "having" too, but I got stuck when I tried to select the max(fdate) such that fdate<announcement. In other words, how would it be possible to use "having" so that I can select the most recent fdate [i.e., max(fdate)] from the fdates that are less than announcement?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Post some test data in the form of a datastep, and what the output should look like.  

mkeintz
PROC Star

@Yegen wrote:

Thanks, @RW9. With your code, however, I cannot compare the max(fdate) with the announcement. I was trying to use "having" too, but I got stuck when I tried to select the max(fdate) such that fdate<announcement. In other words, how would it be possible to use "having" so that I can select the most recent fdate [i.e., max(fdate)] from the fdates that are less than announcement?


 

Include that date constraint prior to the "group by  ... having" clause:

 

proc sql;
  create table work.sample1 as
    select a.*, b.* from 
    WORK.IMPORT2 as a 
      right join
    work.portfolio as b
    on a.mgrno=b.mgrno
    where a.fdate<=b.announcement
    group by a.mgrno        having a.fdate=max(a.fdate);
quit;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Yegen
Pyrite | Level 9

@mkeintz, thanks for your suggestion. Smiley HappyI somehow missed to see your suggestion and just saw it. I had a very similar code before but it did not work. I gave your suggestion a try and it seems to work very well. Just to ask a clarification question, the following part of the code is executed after removing all rows that have fdate < announcement, right? 
  

    group by a.mgrno        having a.fdate=max(a.fdate);

I run the code on a subset of the dataset and the output looked correct. Before applying it to the larger dataset, I wanted to double check that the "having" clause is always run on the dataset that is created after the "where" clause (i.e., on the subset of the original data). 

 

mkeintz
PROC Star

@Yegen wrote:

Just to ask a clarification question, the following part of the code is executed after removing all rows that have fdate < announcement, right? 


 

I think so.  But, in the experimental dataset, move the where clause to after the "group by ... having".  I suspect you will get fewer obserations, or maybe zero observations (if each mgr had at least one fdate>announcement).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User

You did not post the output yet .

 

 

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno announcement;
run;
proc sort data=import2;
by mgrno fdate;
run;
data want;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;
 
Yegen
Pyrite | Level 9

Thanks for posting the data code, @Ksharp.

The output data would need to look like:

mgrnofdatecusipannouncement
938512/31/08123459/16/09
938512/31/08234569/16/09
938512/31/08789129/16/09
628909/13/08345679/14/08


These are the rows that have a max(fdate) that is smaller than the announcement. 

Ksharp
Super User

OK. How about this one .

 

data portfolio;
infile cards expandtabs truncover;
input mgrno	announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385	9/16/09
62890	9/14/08
;
run;
data IMPORT2;
 infile cards expandtabs truncover;
input mgrno	fdate  : mmddyy10.	cusip;
format fdate mmddyy10.;
cards;
9385	12/31/07	12345
9385	12/31/08	12345
9385	12/31/09	12345
9385	12/31/07	23456
9385	12/31/08	23456
9385	12/31/07	34567
9385	12/31/07	78911
9385	12/31/08	78912
62890	9/13/08	34567
62890	9/13/07	12345
62890	9/13/11	22345
;
run;
proc sort data=portfolio;
by mgrno descending announcement;
run;
proc sort data=import2;
by mgrno descending fdate;
run;
data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
 by mgrno descending fdate;
 retain want;
 if first.mgrno then call missing(want);
 if ina then want=fdate;
 if inb;
format want mmddyy10.;
run;
data want;
 set temp(where=(want is not missing));
 by mgrno fdate notsorted;
 if first.mgrno then n=0;
 n+first.fdate;
 if n=1;
run;
Ksharp
Super User

if max(fdate) = announcement. 

 

You can reverse the position of these two tables.

set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);

-->

set  import2(in=inb) portfolio(rename=(announcement=fdate) in=ina);

 

Yegen
Pyrite | Level 9

@Ksharp, you just blow my mind with the two codes. They work magically well! Before testing it on the actual dataset (with close to a billion observations), I have two quick questions.

Q1: Could you please elaborate what you are doing in the following step (especially the set statement):

data temp;
 set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
by mgrno descending fdate; retain want; if first.mgrno then call missing(want); if ina then want=fdate; if inb; format want mmddyy10.; run;

By looking at the output, I can infer that you are adding the announcement dates to the fdates where the fdate < announcement date. Then in the last part of the code (not pasted in this reply, but is in your post), you are keeping only the observations that are the most recent fdates. 

Q2: What's the difference between reversing the position of the two tables? Which one would it be more efficient if there is a lot of redundant / unwanted data in the table that has the fdate (i.e., import2). 

Thanks again, I will give it a try on the actual dataset. 

Yegen
Pyrite | Level 9

@Ksharp would it also be possible to skip the following sorting part of the code: 

proc sort data=import2;
by mgrno descending fdate;
run;

The import2 dataset is the enormously large one and I am retrieving it from another server (so it's not on my SAS Studio or local disk). In fact, when I try to do so, I get the following error message:

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         proc sort data=mbp.t21 force;
 63         by mgrno descending fdate;
 64         run;
 
 ERROR: User does not have appropriate authorization level for library MBP.
 NOTE: The SAS System stopped processing this step because of errors.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.00 seconds
       cpu time            0.00 seconds
       
 65         
 66         
 67         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

 

Ksharp
Super User

NO. You can't . But if your data has been sorted by these two variables. You could try to skip these proc sort and see what happened.

 

Alternative way is making a copy and proc sort .

 

data copy ;

 set mbp.t21 ;

run;

proc sort data=copy............

Ksharp
Super User

1) SET + BY 

 is crossing insert obs according BY varibale.

if you delete "if inb;" you will see what happen.

 

E.X.

TableA

date

2010-01-02

2012-02-02

 

TableB

date

2010-01-04

2012-02-01

 

After SET+BY 

data want;

 set A B;

 by date;

run;

 

You will get:

2010-01-02  <--From A

2010-01-04

2012-02-01

2012-02-02 <--From B

 

2)fdate = announcement ,and you don't want this fdate as the max(most recent) fdate,

you could reverse the position of these two table.

E.X.

After SET+BY , you get this

2010-01-04

2012-02-02

2012-02-02 <--From Portfolio

 

if you don't want 2012-02-02 as the most recent fdate, change  position of these two table ,You will got:

2010-01-04

2012-02-02 <--From Portfolio

2012-02-02 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 23 replies
  • 3412 views
  • 10 likes
  • 5 in conversation