BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

I want to ask a question about sort a very big data set and take first raw for each ID.

Let's say that the raw data (Called: ttt) has 3 columns: ID, bank, amount.

I want to sort by ID,amount and take for each ID the raw with highest amount.

 The problem that this query takes very long time.

Is there a better way to do it?

 

 

proc sort data=ttt;
by ID descending amount;
Run;
Data wanted;
set ttt;
by ID ;
if first.ID then output;
run;

 

 

 

5 REPLIES 5
PeterClemmensen
Tourmaline | Level 20

I assume that it is the Sort process that takes time?

 

Is it a requirement to sort the data?

Ronein
Meteorite | Level 14

Yes, sort process last forever and never finish.

The data set includes 25 million rows.

Please note that there is no key to the data set so cannot define index.

 

ballardw
Super User

@Ronein wrote:

Yes, sort process last forever and never finish.

The data set includes 25 million rows.

Please note that there is no key to the data set so cannot define index.

 


Does your data set contain 100's or 1000's of other variables? If so try the TAGSORT option. That can reduce the size of temporary data sets used in sorting and reduce disk IO common with the sort process.

 

25 million is not really that large of data set and if it taking a long time and "never" finishes - just how long did you wait? then you may be sorting in an external database? (Use that DB sort tools or pass through code) . I just created a data set with 25 million records, random values which may take longer to sort than grouping variables like IDs, and that took about 9.7 minutes.

or you are working on a really crappy network connection or similar.

 

s_lassen
Meteorite | Level 14

If the data is already sorted by ID (but not descending amount), you can skip the sort like this:

data want;
  set have;
  by id;
  if first.id then _max=amount;
  else _max= max(_max,amount);
  if amount=_max then _bank=bank;
  if last.id;
  bank=_bank;
  amount=_max;
  retain _:;
  drop _:;
run;

Else you can use a hash table in order to create the table you want without sorting first:

data _null_;
  set have end=done;
  _Bank=Bank; 
  _amount=amount;
  if _N_=1 then do;
    declare hash h(ordered: 'Y');
    h.definekey('ID');
    h.definedata('ID','Bank','Amount');
    h.definedone();
    end;
  if h.find() then
    h.add();
  else if _amount>amount then do;
    Bank=_Bank;
    Amount=_Amount;
    h.replace();
    end;
  if done;
  h.output(dataset: 'Want');
run;

 

Ksharp
Super User

This code could replace PROC SORT to save time.

Make a macro and call execute() to go through all the id levels to get then final result by combining these sub-datasets .

 

data F M;
 set sashelp.class;
 select(sex);
 when ('F')  output F;
 when ('M')  output M;
 else;
 end;
run;

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 807 views
  • 3 likes
  • 5 in conversation