02-01-2018 08:33 PM
I have the below code which will sort by "Ref" and in the following datastep I use if first.Ref to return a Unique count.
My question is. Is there a better more efficient way to do the proc sort or another way altogether to achieve what I doing.
The Proc Sort takes a while to run as I have over 17 million rows in the data I'm using.
proc sort data=TestI; by Ref; run; data TestII; set TestI; by Ref; if first. Ref then Unique_Count=1; if Balance >0 then IDB=Ref; if Balance >0 then Balance_Count =1; run;;
02-02-2018 01:57 AM
Bottom line: proc sort is the quickest way to get a dataset sorted in SAS.
But there are some tunable things.
Run the sort with
so you get a clear picture if you're running out of CPU power or suffer form I/O waits.
Separate your WORK and utility location (use the utilloc= option in your configuration file), so that they sit on physically separate volumes, if you use disks internal to the server.
If your disks sit on a SAN, check with the SAN admin for possible hostspots, or other ways to improve storage performance.
If you work with a compressed dataset (and a considerable compression rate), try the tagsort option in the proc sort statement. This prevents writing of an uncompressed utility file.
Some summary-related things can be done easier with proc summary/means. Depending on the cardinality of Ref, you might be better off with using proc summary with
avoiding the sort altogether. It's just that the summary structures for all unique values of Ref need to fit into available memory.
02-02-2018 03:31 AM
A couple of things to consider ...
First, where are you headed with the results? Depending on the planned analysis steps, there may be another way to get there. It's easy enough to add to a subsequent analysis step: where balance > 0;
Second, does the subsequent analysis require sorted data, or does it only require that the three new variables exist? SQL (or even PROC FREQ) can get a count of the number of distinct values of REF.
A good hash programmer (not me) should be able to create the three variables without sorting. Well, any SAS programmer could create two out of the three. But Unique_Count could be created by:
Create a HASH table based on REF. Along the way, for each observation check whether that REF value already exists in the hash table. If not, give Unique_Count a value of 1, and add that REF value to the hash table.
02-02-2018 08:08 AM - edited 02-03-2018 04:15 AM
Here are some option to fast PROC SORT.
options bufno=100 bufsize=128k; proc sort data=sashelp.class out=class sortsize=max noequals;
by sex; run;
Another fast way is spliting the big table into smalle table ,and combine them together.
You can write a macro to automatic it.
data F M; set sashelp.class; if sex='F' then output F; else if sex='M' then output M; run; data want; set F M; run;
02-02-2018 10:42 AM
Depending on your platform, there are also host specific options that may help. For example, on Windows Server I've used SGIO (for mulit-gigabyte sized SAS data sets) and MEMLIB (to process the data in memory).