Hello
I want to ask a question about sort a very big data set and take first raw for each ID.
Let's say that the raw data (Called: ttt) has 3 columns: ID, bank, amount.
I want to sort by ID,amount and take for each ID the raw with highest amount.
The problem that this query takes very long time.
Is there a better way to do it?
proc sort data=ttt;
by ID descending amount;
Run;
Data wanted;
set ttt;
by ID ;
if first.ID then output;
run;
I assume that it is the Sort process that takes time?
Is it a requirement to sort the data?
Yes, sort process last forever and never finish.
The data set includes 25 million rows.
Please note that there is no key to the data set so cannot define index.
@Ronein wrote:
Yes, sort process last forever and never finish.
The data set includes 25 million rows.
Please note that there is no key to the data set so cannot define index.
Does your data set contain 100's or 1000's of other variables? If so try the TAGSORT option. That can reduce the size of temporary data sets used in sorting and reduce disk IO common with the sort process.
25 million is not really that large of data set and if it taking a long time and "never" finishes - just how long did you wait? then you may be sorting in an external database? (Use that DB sort tools or pass through code) . I just created a data set with 25 million records, random values which may take longer to sort than grouping variables like IDs, and that took about 9.7 minutes.
or you are working on a really crappy network connection or similar.
If the data is already sorted by ID (but not descending amount), you can skip the sort like this:
data want;
set have;
by id;
if first.id then _max=amount;
else _max= max(_max,amount);
if amount=_max then _bank=bank;
if last.id;
bank=_bank;
amount=_max;
retain _:;
drop _:;
run;
Else you can use a hash table in order to create the table you want without sorting first:
data _null_;
set have end=done;
_Bank=Bank;
_amount=amount;
if _N_=1 then do;
declare hash h(ordered: 'Y');
h.definekey('ID');
h.definedata('ID','Bank','Amount');
h.definedone();
end;
if h.find() then
h.add();
else if _amount>amount then do;
Bank=_Bank;
Amount=_Amount;
h.replace();
end;
if done;
h.output(dataset: 'Want');
run;
This code could replace PROC SORT to save time.
Make a macro and call execute() to go through all the id levels to get then final result by combining these sub-datasets .
data F M; set sashelp.class; select(sex); when ('F') output F; when ('M') output M; else; end; run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.