I am looking at a large dataset of payment values and want to keep the last 20 records per account - How would i do this? Probably very simple but cannot figure this out. Example code for keeping last 1 record per account is as follows.
Data TransRef;
Set Transactions;
by Accno TranDate;
If Last.Accno;
run;
Thanks
I have tried OBS and retain but no luck
I assume your data are already sorted by ACCNO/TRANDATE. True, sorting by descending date (within each ACCNO) would allow you to take the first 20 of each ACCNO, a relatively straightforward process. But sorting is expensive, and you might need to re-sort the results by ascending date.
EDIT: Please see revised code at the end of this note.
So here is a two pass process (passing through each ACCNO twice, but in the original order:
data transfer (drop=_:);
set transactions (in=firstpass)
transactions (in=secondpass) ;
by accnow;
_n1 + firstpass;
_n2 + secondpass;
if secondpass=1 and _n2>= _n1-20;
run;
Even though you pass through the entire dataset twice, it won't require twice the data-reading resources, because the SET statement interleaves the firstpass for a given ACCNO with the secondpass for the same ACCNO. Since the most recently read observations (from firstpass) will be cached in memory, they will therefore be available to the secondpass without rereading data from the storage device.
EDIT:
I realized the subsetting IF was providing 21 (not 20) observations for each ACCNOW. The code should have used
_n2 > _n1-20
instead of
_n2 >= _n1-20
But there is also a second problem. The code above works fine for each ACCNOW that has 20 or more observations. But it also outputs all observations for any ACCNOW group with fewer than 20 obs. If you want to exclude such small groups, then:
data transfer;
set transactions (in=firstpass)
transactions (in=secondpass) ;
by accnow;
if first.accnow then call missing(_n1,_n2);
_n1 + firstpass;
_n2 + secondpass;
if secondpass=1 and _n1>=20 and _n2 > _n1-20 ;
run;
To keep the LAST 20 observations when ordered in that way you would need to know how many observations there were for each BY value. One way would be to make two passes thru the data for each BY group. One to count. One to actually output.
data TransRef;
do nrows=1 by 1 until(last.accno);
set Transactions;
by Accno TranDate;
end;
do row=1 to nrows;
set Transactions;
if row >= (nrows-20) then output;
end;
run;
But if you read the data by DESCENDING date then your request converts to finding the FIRST 20 observations. Which is much easier.
data TransRef;
set Transactions;
by Accno descending TranDate;
if first.Accno then row=1;
else row+1;
if row <= 20 ;
run;
I need to keep the latest dated transactions, so if sorting Ascending, then it would be the first 20, if descending it would be the last 20
Does that change anything within the code as i cant seem to get the results i want.
Thanks,
@KMWWN156 wrote:
I need to keep the latest dated transactions, so if sorting Ascending, then it would be the
firstlast 20, if descending it would be the last 20
Does that change anything within the code as i cant seem to get the results i want.
Thanks,
Sort DESCENDING to use FIRST 20.
Proc sort data=Transactions; by Accno descending TranDate; run;
@KMWWN156 wrote:
I need to keep the latest dated transactions, so if sorting Ascending, then it would be the first 20, if descending it would be the last 20
Does that change anything within the code as i cant seem to get the results i want.
Thanks,
Seems like you are saying contradictory things.
In English "latest date" means the most recent dates. So the largest dates. Which would the last few observations when sorted by ascending.
If you mean the oldest dates (or the earliest dates) then that would be the first few observations when sorted by ascending dates.
I assume your data are already sorted by ACCNO/TRANDATE. True, sorting by descending date (within each ACCNO) would allow you to take the first 20 of each ACCNO, a relatively straightforward process. But sorting is expensive, and you might need to re-sort the results by ascending date.
EDIT: Please see revised code at the end of this note.
So here is a two pass process (passing through each ACCNO twice, but in the original order:
data transfer (drop=_:);
set transactions (in=firstpass)
transactions (in=secondpass) ;
by accnow;
_n1 + firstpass;
_n2 + secondpass;
if secondpass=1 and _n2>= _n1-20;
run;
Even though you pass through the entire dataset twice, it won't require twice the data-reading resources, because the SET statement interleaves the firstpass for a given ACCNO with the secondpass for the same ACCNO. Since the most recently read observations (from firstpass) will be cached in memory, they will therefore be available to the secondpass without rereading data from the storage device.
EDIT:
I realized the subsetting IF was providing 21 (not 20) observations for each ACCNOW. The code should have used
_n2 > _n1-20
instead of
_n2 >= _n1-20
But there is also a second problem. The code above works fine for each ACCNOW that has 20 or more observations. But it also outputs all observations for any ACCNOW group with fewer than 20 obs. If you want to exclude such small groups, then:
data transfer;
set transactions (in=firstpass)
transactions (in=secondpass) ;
by accnow;
if first.accnow then call missing(_n1,_n2);
_n1 + firstpass;
_n2 + secondpass;
if secondpass=1 and _n1>=20 and _n2 > _n1-20 ;
run;
proc sort data=sashelp.stocks out=have;
by stock date;
run;
data want;
merge have have(keep=stock rename=(stock=_stock) firstobs=21);
if stock ne _stock;
run;
Or Check @Rick_SAS blog:
https://blogs.sas.com/content/iml/2024/07/10/largest-values-for-group.html
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.