- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am looking at a large dataset of payment values and want to keep the last 20 records per account - How would i do this? Probably very simple but cannot figure this out. Example code for keeping last 1 record per account is as follows.
Data TransRef;
Set Transactions;
by Accno TranDate;
If Last.Accno;
run;
Thanks
I have tried OBS and retain but no luck
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I assume your data are already sorted by ACCNO/TRANDATE. True, sorting by descending date (within each ACCNO) would allow you to take the first 20 of each ACCNO, a relatively straightforward process. But sorting is expensive, and you might need to re-sort the results by ascending date.
EDIT: Please see revised code at the end of this note.
So here is a two pass process (passing through each ACCNO twice, but in the original order:
data transfer (drop=_:);
set transactions (in=firstpass)
transactions (in=secondpass) ;
by accnow;
_n1 + firstpass;
_n2 + secondpass;
if secondpass=1 and _n2>= _n1-20;
run;
Even though you pass through the entire dataset twice, it won't require twice the data-reading resources, because the SET statement interleaves the firstpass for a given ACCNO with the secondpass for the same ACCNO. Since the most recently read observations (from firstpass) will be cached in memory, they will therefore be available to the secondpass without rereading data from the storage device.
EDIT:
I realized the subsetting IF was providing 21 (not 20) observations for each ACCNOW. The code should have used
_n2 > _n1-20
instead of
_n2 >= _n1-20
But there is also a second problem. The code above works fine for each ACCNOW that has 20 or more observations. But it also outputs all observations for any ACCNOW group with fewer than 20 obs. If you want to exclude such small groups, then:
data transfer;
set transactions (in=firstpass)
transactions (in=secondpass) ;
by accnow;
if first.accnow then call missing(_n1,_n2);
_n1 + firstpass;
_n2 + secondpass;
if secondpass=1 and _n1>=20 and _n2 > _n1-20 ;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To keep the LAST 20 observations when ordered in that way you would need to know how many observations there were for each BY value. One way would be to make two passes thru the data for each BY group. One to count. One to actually output.
data TransRef;
do nrows=1 by 1 until(last.accno);
set Transactions;
by Accno TranDate;
end;
do row=1 to nrows;
set Transactions;
if row >= (nrows-20) then output;
end;
run;
But if you read the data by DESCENDING date then your request converts to finding the FIRST 20 observations. Which is much easier.
data TransRef;
set Transactions;
by Accno descending TranDate;
if first.Accno then row=1;
else row+1;
if row <= 20 ;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I need to keep the latest dated transactions, so if sorting Ascending, then it would be the first 20, if descending it would be the last 20
Does that change anything within the code as i cant seem to get the results i want.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@KMWWN156 wrote:
I need to keep the latest dated transactions, so if sorting Ascending, then it would be the
firstlast 20, if descending it would be the last 20
Does that change anything within the code as i cant seem to get the results i want.
Thanks,
Sort DESCENDING to use FIRST 20.
Proc sort data=Transactions; by Accno descending TranDate; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@KMWWN156 wrote:
I need to keep the latest dated transactions, so if sorting Ascending, then it would be the first 20, if descending it would be the last 20
Does that change anything within the code as i cant seem to get the results i want.
Thanks,
Seems like you are saying contradictory things.
In English "latest date" means the most recent dates. So the largest dates. Which would the last few observations when sorted by ascending.
If you mean the oldest dates (or the earliest dates) then that would be the first few observations when sorted by ascending dates.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I assume your data are already sorted by ACCNO/TRANDATE. True, sorting by descending date (within each ACCNO) would allow you to take the first 20 of each ACCNO, a relatively straightforward process. But sorting is expensive, and you might need to re-sort the results by ascending date.
EDIT: Please see revised code at the end of this note.
So here is a two pass process (passing through each ACCNO twice, but in the original order:
data transfer (drop=_:);
set transactions (in=firstpass)
transactions (in=secondpass) ;
by accnow;
_n1 + firstpass;
_n2 + secondpass;
if secondpass=1 and _n2>= _n1-20;
run;
Even though you pass through the entire dataset twice, it won't require twice the data-reading resources, because the SET statement interleaves the firstpass for a given ACCNO with the secondpass for the same ACCNO. Since the most recently read observations (from firstpass) will be cached in memory, they will therefore be available to the secondpass without rereading data from the storage device.
EDIT:
I realized the subsetting IF was providing 21 (not 20) observations for each ACCNOW. The code should have used
_n2 > _n1-20
instead of
_n2 >= _n1-20
But there is also a second problem. The code above works fine for each ACCNOW that has 20 or more observations. But it also outputs all observations for any ACCNOW group with fewer than 20 obs. If you want to exclude such small groups, then:
data transfer;
set transactions (in=firstpass)
transactions (in=secondpass) ;
by accnow;
if first.accnow then call missing(_n1,_n2);
_n1 + firstpass;
_n2 + secondpass;
if secondpass=1 and _n1>=20 and _n2 > _n1-20 ;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc sort data=sashelp.stocks out=have;
by stock date;
run;
data want;
merge have have(keep=stock rename=(stock=_stock) firstobs=21);
if stock ne _stock;
run;
Or Check @Rick_SAS blog:
https://blogs.sas.com/content/iml/2024/07/10/largest-values-for-group.html