BookmarkSubscribeRSS Feed
Ronein
Onyx | Level 15

Hello

I want to ask a question about sort a very big data set and take first raw for each ID.

Let's say that the raw data (Called: ttt) has 3 columns: ID, bank, amount.

I want to sort by ID,amount and take for each ID the raw with highest amount.

 The problem that this query takes very long time.

Is there a better way to do it?

 

 

proc sort data=ttt;
by ID descending amount;
Run;
Data wanted;
set ttt;
by ID ;
if first.ID then output;
run;

 

 

 

5 REPLIES 5
PeterClemmensen
Tourmaline | Level 20

I assume that it is the Sort process that takes time?

 

Is it a requirement to sort the data?

Ronein
Onyx | Level 15

Yes, sort process last forever and never finish.

The data set includes 25 million rows.

Please note that there is no key to the data set so cannot define index.

 

ballardw
Super User

@Ronein wrote:

Yes, sort process last forever and never finish.

The data set includes 25 million rows.

Please note that there is no key to the data set so cannot define index.

 


Does your data set contain 100's or 1000's of other variables? If so try the TAGSORT option. That can reduce the size of temporary data sets used in sorting and reduce disk IO common with the sort process.

 

25 million is not really that large of data set and if it taking a long time and "never" finishes - just how long did you wait? then you may be sorting in an external database? (Use that DB sort tools or pass through code) . I just created a data set with 25 million records, random values which may take longer to sort than grouping variables like IDs, and that took about 9.7 minutes.

or you are working on a really crappy network connection or similar.

 

s_lassen
Meteorite | Level 14

If the data is already sorted by ID (but not descending amount), you can skip the sort like this:

data want;
  set have;
  by id;
  if first.id then _max=amount;
  else _max= max(_max,amount);
  if amount=_max then _bank=bank;
  if last.id;
  bank=_bank;
  amount=_max;
  retain _:;
  drop _:;
run;

Else you can use a hash table in order to create the table you want without sorting first:

data _null_;
  set have end=done;
  _Bank=Bank; 
  _amount=amount;
  if _N_=1 then do;
    declare hash h(ordered: 'Y');
    h.definekey('ID');
    h.definedata('ID','Bank','Amount');
    h.definedone();
    end;
  if h.find() then
    h.add();
  else if _amount>amount then do;
    Bank=_Bank;
    Amount=_Amount;
    h.replace();
    end;
  if done;
  h.output(dataset: 'Want');
run;

 

Ksharp
Super User

This code could replace PROC SORT to save time.

Make a macro and call execute() to go through all the id levels to get then final result by combining these sub-datasets .

 

data F M;
 set sashelp.class;
 select(sex);
 when ('F')  output F;
 when ('M')  output M;
 else;
 end;
run;

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1590 views
  • 3 likes
  • 5 in conversation