The data we have on hand often contains the information we need, but not sequenced in the order required for processing. Reporting, comparing the data in one table to the data in another, and conducting merges or joins all require properly sequenced data. The technique that produces the desired result most quickly varies depending on factors such as how and where the data is stored, the size of the data, and the number of times ordered data will be read.
SAS data sets store data in the order it is written to the disk file. Many SAS processes, like DATA step and PROC PRINT, read and process the data in the order in which it was stored. Consider this data set and the desired report:
|
|
Note that the data is stored chronologically, but the report must be written in AREA order. No matter what technique we choose, sorting of the data will be required. So, what difference does it make if we choose one technique over the other? Today, we will look at several techniques we can use to produce this report, and their impact on resources and elapsed time. Some of these techniques hold the data in memory, and will be more useful when the data small enough to fit into memory all at once. Other techniques use significant amounts of disk space and disk I/O. These will be more useful when the data is too large to fit into available memory.
If you need to read the sorted version of the data multiple times, you might use PROC SORT to create sorted a copy of the data stored in the specified order. For example, this code sorts the data “in place”, that is, it reproduces the original dataset stored in the new order:
Code:
proc sort data=la.crime_sample;
by area;
run;
proc print data=la.crime_sample noobs;
var area date_occ time_occ;
run;
Log:
NOTE: PROCEDURE SORT used (Total process time): real time 0.10 seconds cpu time 0.01 seconds … NOTE: PROCEDURE PRINT used (Total process time): real time 0.07 seconds cpu time 0.01 seconds
The process uses a total of 0.17 sec of elapsed time. Here are the simplified steps that happened during processing:
NOTE:
You could improve the process a little by using the OUT= option to write the results to a separate library instead of sorting in place. This eliminates the need to rename the output and delete the original. And if the other library is located on a separate disk, this can often improve I/O speed.
One drawback is the need to read the data from disk again each time the data is accessed. If you have enough memory, you can overcome this using the SASFILE statement. With the SASFILE statement, you can load the entire dataset into memory and then use the in-memory copy for subsequent processing without incurring additional disk I/O.
Code:
proc sort data=la.crime_sample
out=work.crime_sample;
by area;
run;
/* Use SASFILE to hold the data in memory */
sasfile work.crime_sample load;
proc print data=work.crime_sample noobs;
var area date_occ time_occ;
run;
Log:
NOTE: PROCEDURE SORT used (Total process time):
real time 0.09 seconds
cpu time 0.01 seconds
…
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.05 seconds
cpu time 0.03 seconds
The process uses a total of 0.14 sec of elapsed time. The time required to sort was about the same, but it took less time for PROC PRINT to read the data because you are using an in-memory copy. Any subsequent step that reads this data will also benefit, because you don’t have to get the data from disk each time you use it. But don’t forget to close the file to release the memory when you are finished processing!
sasfile work.crime_sample close;
If you only need the sorted values once, consider using SQL to avoid having to pre-sort the data. PROC SQL and PROC FedSQL will automatically sort the result set as specified by the ORDER BY clause, without the need to write an intermediate copy of the sorted data to disk.
Code:
proc sql;
select area
,date_occ
,time_occ
from la.crime_sample as cs
order by area
;
quit;
Log:
NOTE: PROCEDURE SQL used (Total process time):
real time 0.07 seconds
cpu time 0.00 seconds
The total elapsed time for this process was only 0.07 seconds – about half of that required for the PROC SORT / PROC PRINT combination.
Now, we don’t often think of using a DATA step to sort data, but it’s possible. Because we’re using a HASH object, we’ll once again need enough memory to hold the entire dataset.
Code:
data _null_;
if (_n_ = 1) then do;
declare hash myhash(dataset: "la.crime_sample" /*Input dataset*/
,multidata: "yes" /*Allow duplicate keys*/
,ordered:"ascending"); /*Sort the data by key*/
rc = myhash.definekey('area'); /*Define the key*/
/*List the variables you want in the output dataset*/
rc = myhash.definedata('area','date_occ','time_occ');
myhash.definedone();
call missing(area, date_occ, time_occ); /*precludes a warning*/
end;
/*Write out the output dataset*/
rc = myhash.output(dataset:"work.crime_sample");
run;
Log:
NOTE: DATA statement used (Total process time): real time 0.10 seconds cpu time 0.01 seconds
The time required to sort the data is about the same as when using PROC SORT, and PROC SORT has more options and flexibility, so while it can be done, I wouldn’t consider using this technique unless I needed the hash object for some other process in the DATA step.
Now, one of the primary reasons I've had to sort data in the past is to perform a DATA step merge. When a BY statement is present in SAS code (except for PROC SORT of course), SAS assumes you have pre-sorted the data, and an error is generated if an out-of-sequence value is encountered. Did you know that one of the joys of using SAS/ACCESS to work with database tables in SAS is that you never have to pre-sort database data? Data is retrieved from the database using generated SQL and, if a BY statement is found in the SAS code, an appropriate ORDER BY clause is automatically included. Pretty slick, no? Can you name other conditions where a BY statement doesn't require pre-sorted data? I can think of at least one 😁
Until next time, may the SAS be with you!
Mark
Grab the ZIP file containing the code and data for this blog series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip
Links to prior posts in this series:
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.