BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
KMWWN156
Calcite | Level 5

I am looking at a large dataset of payment values and want to keep the last 20 records per account - How would i do this? Probably very simple but cannot figure this out. Example code for keeping last 1 record per account is as follows.

Data TransRef;
Set Transactions;
by Accno TranDate;
If Last.Accno;
run;

Thanks

I have tried OBS and retain but no luck

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

I assume your data are already sorted by ACCNO/TRANDATE.   True, sorting by descending date (within each ACCNO) would allow you to take the first 20 of each ACCNO, a relatively straightforward process.  But sorting is expensive, and you might need to re-sort the results by ascending date.

 

EDIT:  Please see revised code at the end of this note.

 

So here is a two pass process (passing through each ACCNO twice, but in the original order:

 

data transfer (drop=_:);
  set transactions (in=firstpass)
      transactions (in=secondpass) ;
  by accnow;
  _n1 + firstpass;
  _n2 + secondpass;
  if secondpass=1 and _n2>= _n1-20;
run;

Even though you pass through the entire dataset twice, it won't require twice the data-reading resources, because the SET statement interleaves the firstpass for a given ACCNO with the secondpass for the same ACCNO.  Since the most recently read observations (from firstpass) will be cached in memory, they will therefore be available to the secondpass without rereading data from the storage device.

 

EDIT:

 

I realized the subsetting IF was providing 21 (not 20) observations for each ACCNOW.  The code should have used

_n2 > _n1-20 

instead of 

_n2 >= _n1-20 

But there is also a second problem.  The code above works fine for each ACCNOW that has 20 or more observations.  But it also outputs all observations for any ACCNOW group with fewer than 20 obs.  If you want to exclude such small groups, then:

 

data transfer;
  set transactions (in=firstpass)
      transactions (in=secondpass) ;
  by accnow;
  if first.accnow then call missing(_n1,_n2);
  _n1 + firstpass;
  _n2 + secondpass;
  if secondpass=1 and _n1>=20 and _n2 > _n1-20 ;
run;

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

To keep the LAST 20 observations when ordered in that way you would need to know how many observations there were for each BY value.  One way would be to make two passes thru the data for each BY group.  One to count. One to actually output.

data TransRef;
  do nrows=1 by 1 until(last.accno);
    set Transactions;
    by Accno TranDate;
  end;
  do row=1 to nrows;
    set Transactions;
    if row >= (nrows-20) then output;
  end;
run;

But if you read the data by DESCENDING date then your request converts to finding the FIRST 20 observations.  Which is much easier.

data TransRef;
  set Transactions;
  by Accno descending TranDate;
  if first.Accno then row=1;
  else row+1;
  if row <= 20 ;
run;
KMWWN156
Calcite | Level 5

I need to keep the latest dated transactions, so if sorting Ascending, then it would be the first 20, if descending it would be the last 20

 

Does that change anything within the code as i cant seem to get the results i want.

 

Thanks,

ballardw
Super User

@KMWWN156 wrote:

I need to keep the latest dated transactions, so if sorting Ascending, then it would be the first last 20, if descending it would be the last 20

 

Does that change anything within the code as i cant seem to get the results i want.

 

Thanks,


Sort DESCENDING to use FIRST 20.

Proc sort data=Transactions;
  by Accno descending TranDate;
run;

 

Tom
Super User Tom
Super User

@KMWWN156 wrote:

I need to keep the latest dated transactions, so if sorting Ascending, then it would be the first 20, if descending it would be the last 20

 

Does that change anything within the code as i cant seem to get the results i want.

 

Thanks,


Seems like you are saying contradictory things. 

 

In English "latest date" means the most recent dates. So the largest dates.  Which would the last few observations when sorted by ascending.

 

If you mean the oldest dates (or the earliest dates) then that would be the first few observations when sorted by  ascending dates. 

mkeintz
PROC Star

I assume your data are already sorted by ACCNO/TRANDATE.   True, sorting by descending date (within each ACCNO) would allow you to take the first 20 of each ACCNO, a relatively straightforward process.  But sorting is expensive, and you might need to re-sort the results by ascending date.

 

EDIT:  Please see revised code at the end of this note.

 

So here is a two pass process (passing through each ACCNO twice, but in the original order:

 

data transfer (drop=_:);
  set transactions (in=firstpass)
      transactions (in=secondpass) ;
  by accnow;
  _n1 + firstpass;
  _n2 + secondpass;
  if secondpass=1 and _n2>= _n1-20;
run;

Even though you pass through the entire dataset twice, it won't require twice the data-reading resources, because the SET statement interleaves the firstpass for a given ACCNO with the secondpass for the same ACCNO.  Since the most recently read observations (from firstpass) will be cached in memory, they will therefore be available to the secondpass without rereading data from the storage device.

 

EDIT:

 

I realized the subsetting IF was providing 21 (not 20) observations for each ACCNOW.  The code should have used

_n2 > _n1-20 

instead of 

_n2 >= _n1-20 

But there is also a second problem.  The code above works fine for each ACCNOW that has 20 or more observations.  But it also outputs all observations for any ACCNOW group with fewer than 20 obs.  If you want to exclude such small groups, then:

 

data transfer;
  set transactions (in=firstpass)
      transactions (in=secondpass) ;
  by accnow;
  if first.accnow then call missing(_n1,_n2);
  _n1 + firstpass;
  _n2 + secondpass;
  if secondpass=1 and _n1>=20 and _n2 > _n1-20 ;
run;

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
proc sort data=sashelp.stocks out=have;
by stock date;
run;
data want;
 merge have have(keep=stock rename=(stock=_stock) firstobs=21);
 if stock ne _stock;
run;

Or Check @Rick_SAS blog:

https://blogs.sas.com/content/iml/2024/07/10/largest-values-for-group.html

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 355 views
  • 4 likes
  • 5 in conversation