Disclaimer: Thanks to @Ksharp for providing extensive help on this issue, the speed of the code increased dramatically (saved several hours). @mkeintz has also provided an alternative solution using PROC SQL. If one wants to use PROC SQL, please see @mkeintz's solution (with the having clause) or @Ksharp's solution. Given that @Ksharp answered first, I had to accept his first. Both thanks so much to both of you!
Question starts here:
I am trying to find an alternative way to write the following code by not relying on in:
proc sql;
create table work.sample1 as
select a.*, b.*
from WORK.IMPORT2 as a
right join work.portfolio as b
on a.mgrno=b.mgrno
where a.fdate in(
select max(fdate) as last_fdate
from WORK.IMPORT2
where a.mgrno=b.mgrno and fdate < Announcement);
quit;
The dataset "work.portfolio" contains entries as follows:
mgrno | announcement |
9385 | 9/16/09 |
62890 | 9/14/08 |
The dataset "WORK.IMPORT2" contains entries as follows:
mgrno | fdate | cusip |
9385 | 12/31/07 | 12345 |
9385 | 12/31/08 | 12345 |
9385 | 12/31/09 | 12345 |
9385 | 12/31/07 | 23456 |
9385 | 12/31/08 | 23456 |
9385 | 12/31/07 | 34567 |
9385 | 12/31/07 | 78911 |
9385 | 12/31/08 | 78912 |
62890 | 9/13/08 | 34567 |
62890 | 9/13/07 | 12345 |
62890 | 9/13/11 | 22345 |
I have the following two questions:
Q1: Is there an alternative (and more efficient) way of writing the code stated above without using IN( )? In particular, I am trying to select the most recent fdates that are before announcement.
Q2: How can I select the most recent fdates by cusip and mgrno such that fdate < announcement? In Q1, I am only concerned about the most recent last fdates by MGRNO, but here I also want to see how to break down this specification into MGRNO and CUSIP.
Thanks in advance for any help / suggestions.
If you have multiple date for the same mgrno.
Try this one .
data portfolio;
infile cards expandtabs truncover;
input mgrno announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385 9/16/09
9385 9/16/08
62890 9/14/08
;
run;
data IMPORT2;
infile cards expandtabs truncover;
input mgrno fdate : mmddyy10. cusip;
format fdate mmddyy10.;
cards;
9385 12/31/07 12345
9385 12/31/08 12345
9385 12/31/09 12345
9385 12/31/07 23456
9385 12/31/08 23456
9385 12/31/07 34567
9385 12/31/07 78911
9385 12/31/08 78912
9385 12/31/06 78912
9385 12/31/06 78912
62890 9/13/08 34567
62890 9/13/07 12345
62890 9/13/11 22345
;
run;
proc sort data=portfolio;
by mgrno descending announcement;
run;
proc sort data=import2;
by mgrno descending fdate;
run;
data temp;
set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
by mgrno descending fdate;
retain want;
if first.mgrno then call missing(want);
if ina then want=fdate;
if inb;
format want mmddyy10.;
run;
data want;
set temp(where=(want is not missing));
by mgrno want fdate notsorted;
if first.want then n=0;
n+first.fdate;
if n=1;
run;
Well, I have no test data to try this on, but maybe something like:
proc sql;
create table work.sample1 as
select a.*, b.*
from WORK.IMPORT2 as a
right join work.portfolio as b
on a.mgrno=b.mgrno
group by a.mgrno
having b.fdate=max(b.fdate);
quit;
Thanks, @RW9. With your code, however, I cannot compare the max(fdate) with the announcement. I was trying to use "having" too, but I got stuck when I tried to select the max(fdate) such that fdate<announcement. In other words, how would it be possible to use "having" so that I can select the most recent fdate [i.e., max(fdate)] from the fdates that are less than announcement?
Post some test data in the form of a datastep, and what the output should look like.
@Yegen wrote:
Thanks, @RW9. With your code, however, I cannot compare the max(fdate) with the announcement. I was trying to use "having" too, but I got stuck when I tried to select the max(fdate) such that fdate<announcement. In other words, how would it be possible to use "having" so that I can select the most recent fdate [i.e., max(fdate)] from the fdates that are less than announcement?
Include that date constraint prior to the "group by ... having" clause:
proc sql;
create table work.sample1 as
select a.*, b.* from
WORK.IMPORT2 as a
right join
work.portfolio as b
on a.mgrno=b.mgrno
where a.fdate<=b.announcement
group by a.mgrno having a.fdate=max(a.fdate);
quit;
@mkeintz, thanks for your suggestion. I somehow missed to see your suggestion and just saw it. I had a very similar code before but it did not work. I gave your suggestion a try and it seems to work very well. Just to ask a clarification question, the following part of the code is executed after removing all rows that have fdate < announcement, right?
group by a.mgrno having a.fdate=max(a.fdate);
I run the code on a subset of the dataset and the output looked correct. Before applying it to the larger dataset, I wanted to double check that the "having" clause is always run on the dataset that is created after the "where" clause (i.e., on the subset of the original data).
@Yegen wrote:
Just to ask a clarification question, the following part of the code is executed after removing all rows that have fdate < announcement, right?
I think so. But, in the experimental dataset, move the where clause to after the "group by ... having". I suspect you will get fewer obserations, or maybe zero observations (if each mgr had at least one fdate>announcement).
You did not post the output yet .
data portfolio;
infile cards expandtabs truncover;
input mgrno announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385 9/16/09
62890 9/14/08
;
run;
data IMPORT2;
infile cards expandtabs truncover;
input mgrno fdate : mmddyy10. cusip;
format fdate mmddyy10.;
cards;
9385 12/31/07 12345
9385 12/31/08 12345
9385 12/31/09 12345
9385 12/31/07 23456
9385 12/31/08 23456
9385 12/31/07 34567
9385 12/31/07 78911
9385 12/31/08 78912
62890 9/13/08 34567
62890 9/13/07 12345
62890 9/13/11 22345
;
run;
proc sort data=portfolio;
by mgrno announcement;
run;
proc sort data=import2;
by mgrno fdate;
run;
data want;
set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
by mgrno fdate;
retain want;
if first.mgrno then call missing(want);
if ina then want=fdate;
if inb;
format want mmddyy10.;
run;
Thanks for posting the data code, @Ksharp.
The output data would need to look like:
mgrno | fdate | cusip | announcement |
9385 | 12/31/08 | 12345 | 9/16/09 |
9385 | 12/31/08 | 23456 | 9/16/09 |
9385 | 12/31/08 | 78912 | 9/16/09 |
62890 | 9/13/08 | 34567 | 9/14/08 |
These are the rows that have a max(fdate) that is smaller than the announcement.
OK. How about this one .
data portfolio;
infile cards expandtabs truncover;
input mgrno announcement : mmddyy10.;
format announcement mmddyy10.;
cards;
9385 9/16/09
62890 9/14/08
;
run;
data IMPORT2;
infile cards expandtabs truncover;
input mgrno fdate : mmddyy10. cusip;
format fdate mmddyy10.;
cards;
9385 12/31/07 12345
9385 12/31/08 12345
9385 12/31/09 12345
9385 12/31/07 23456
9385 12/31/08 23456
9385 12/31/07 34567
9385 12/31/07 78911
9385 12/31/08 78912
62890 9/13/08 34567
62890 9/13/07 12345
62890 9/13/11 22345
;
run;
proc sort data=portfolio;
by mgrno descending announcement;
run;
proc sort data=import2;
by mgrno descending fdate;
run;
data temp;
set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
by mgrno descending fdate;
retain want;
if first.mgrno then call missing(want);
if ina then want=fdate;
if inb;
format want mmddyy10.;
run;
data want;
set temp(where=(want is not missing));
by mgrno fdate notsorted;
if first.mgrno then n=0;
n+first.fdate;
if n=1;
run;
if max(fdate) = announcement.
You can reverse the position of these two tables.
set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
-->
set import2(in=inb) portfolio(rename=(announcement=fdate) in=ina);
@Ksharp, you just blow my mind with the two codes. They work magically well! Before testing it on the actual dataset (with close to a billion observations), I have two quick questions.
Q1: Could you please elaborate what you are doing in the following step (especially the set statement):
data temp;
set portfolio(rename=(announcement=fdate) in=ina) import2(in=inb);
by mgrno descending fdate;
retain want;
if first.mgrno then call missing(want);
if ina then want=fdate;
if inb;
format want mmddyy10.;
run;
By looking at the output, I can infer that you are adding the announcement dates to the fdates where the fdate < announcement date. Then in the last part of the code (not pasted in this reply, but is in your post), you are keeping only the observations that are the most recent fdates.
Q2: What's the difference between reversing the position of the two tables? Which one would it be more efficient if there is a lot of redundant / unwanted data in the table that has the fdate (i.e., import2).
Thanks again, I will give it a try on the actual dataset.
@Ksharp would it also be possible to skip the following sorting part of the code:
proc sort data=import2;
by mgrno descending fdate;
run;
The import2 dataset is the enormously large one and I am retrieving it from another server (so it's not on my SAS Studio or local disk). In fact, when I try to do so, I get the following error message:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
61
62 proc sort data=mbp.t21 force;
63 by mgrno descending fdate;
64 run;
ERROR: User does not have appropriate authorization level for library MBP.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
65
66
67 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
NO. You can't . But if your data has been sorted by these two variables. You could try to skip these proc sort and see what happened.
Alternative way is making a copy and proc sort .
data copy ;
set mbp.t21 ;
run;
proc sort data=copy............
1) SET + BY
is crossing insert obs according BY varibale.
if you delete "if inb;" you will see what happen.
E.X.
TableA
date
2010-01-02
2012-02-02
TableB
date
2010-01-04
2012-02-01
After SET+BY
data want;
set A B;
by date;
run;
You will get:
2010-01-02 <--From A
2010-01-04
2012-02-01
2012-02-02 <--From B
2)fdate = announcement ,and you don't want this fdate as the max(most recent) fdate,
you could reverse the position of these two table.
E.X.
After SET+BY , you get this
2010-01-04
2012-02-02
2012-02-02 <--From Portfolio
if you don't want 2012-02-02 as the most recent fdate, change position of these two table ,You will got:
2010-01-04
2012-02-02 <--From Portfolio
2012-02-02
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.