Manipulating Data in Base SAS Part 2 – Sort

3 Likes

The data we have on hand often contains the information we need, but not sequenced in the order required for processing. Reporting, comparing the data in one table to the data in another, and conducting merges or joins all require properly sequenced data. The technique that produces the desired result most quickly varies depending on factors such as how and where the data is stored, the size of the data, and the number of times ordered data will be read.

Data Sources

SAS data sets store data in the order it is written to the disk file. Many SAS processes, like DATA step and PROC PRINT, read and process the data in the order in which it was stored. Consider this data set and the desired report:

Crime data
Row	DATE_OCC	TIME_OCC	AREA
1	01-01-2024	0:00:05	3
2	01-01-2024	0:30:00	4
3	01-08-2024	0:00:16	5
4	01-16-2024	0:32:10	9
5	01-18-2024	0:33:35	7
6	01-26-2024	0:31:40	1
7	01-27-2024	0:13:20	6
8	01-27-2024	0:32:30	8
9	01-30-2024	0:26:40	2

Desired Report
	AREA	DATE_OCC	TIME_OCC
	1	01-26-2024	0:31:40
	2	01-30-2024	0:26:40
	3	01-01-2024	0:00:05
	4	01-01-2024	0:30:00
	5	01-08-2024	0:00:16
	6	01-27-2024	0:13:20
	7	01-18-2024	0:33:35
	8	01-27-2024	0:32:30
	9	01-16-2024	0:32:10

Note that the data is stored chronologically, but the report must be written in AREA order. No matter what technique we choose, sorting of the data will be required. So, what difference does it make if we choose one technique over the other? Today, we will look at several techniques we can use to produce this report, and their impact on resources and elapsed time. Some of these techniques hold the data in memory, and will be more useful when the data small enough to fit into memory all at once. Other techniques use significant amounts of disk space and disk I/O. These will be more useful when the data is too large to fit into available memory.

PROC SORT

If you need to read the sorted version of the data multiple times, you might use PROC SORT to create sorted a copy of the data stored in the specified order. For example, this code sorts the data “in place”, that is, it reproduces the original dataset stored in the new order:

Code:

proc sort data=la.crime_sample;
	by area;
run;
proc print data=la.crime_sample noobs;
	var area date_occ time_occ;
run;

Log:

NOTE: PROCEDURE SORT used (Total process time):
      real time           0.10 seconds
      cpu time            0.01 seconds
…
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.07 seconds
      cpu time            0.01 seconds

The process uses a total of 0.17 sec of elapsed time. Here are the simplified steps that happened during processing:

The dataset crime_sample was read into memory from disk.
The data was sorted.
The sorted data was written to disk using a temporary name, for example _temp1.
The original dataset, crime_sample, was then deleted, and
The temporary dataset was renamed crime_sample.

NOTE:

The entire data set is read into memory then written back out to disk. Disk I/O is usually the slowest part of the process, so we expect this to take longer than in-memory techniques.
Because the sorted data is stored on disk, we can subsequently access the data in this order many times without having to sort the data again.
At the end of step 3, we need enough disk space to hold 2 full copies of the data.

You could improve the process a little by using the OUT= option to write the results to a separate library instead of sorting in place. This eliminates the need to rename the output and delete the original. And if the other library is located on a separate disk, this can often improve I/O speed.

One drawback is the need to read the data from disk again each time the data is accessed. If you have enough memory, you can overcome this using the SASFILE statement. With the SASFILE statement, you can load the entire dataset into memory and then use the in-memory copy for subsequent processing without incurring additional disk I/O.

Code:

proc sort data=la.crime_sample 
			 out=work.crime_sample;
	by area;
run;

/* Use SASFILE to hold the data in memory */
sasfile work.crime_sample load;

proc print data=work.crime_sample noobs;
	var area date_occ time_occ;
run;

Log:

NOTE: PROCEDURE SORT used (Total process time):
      real time           0.09 seconds
      cpu time            0.01 seconds
…
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.05 seconds
      cpu time            0.03 seconds

The process uses a total of 0.14 sec of elapsed time. The time required to sort was about the same, but it took less time for PROC PRINT to read the data because you are using an in-memory copy. Any subsequent step that reads this data will also benefit, because you don’t have to get the data from disk each time you use it. But don’t forget to close the file to release the memory when you are finished processing!

sasfile work.crime_sample close;

PROC SQL/FedSQL ORDER BY

If you only need the sorted values once, consider using SQL to avoid having to pre-sort the data. PROC SQL and PROC FedSQL will automatically sort the result set as specified by the ORDER BY clause, without the need to write an intermediate copy of the sorted data to disk.

Code:

proc sql;
select area
      ,date_occ
      ,time_occ
   from la.crime_sample as cs
   order by area
;
quit;

Log:

NOTE: PROCEDURE SQL used (Total process time):
      real time           0.07 seconds
      cpu time            0.00 seconds

The total elapsed time for this process was only 0.07 seconds – about half of that required for the PROC SORT / PROC PRINT combination.

DATA Step Hash Object

Now, we don’t often think of using a DATA step to sort data, but it’s possible. Because we’re using a HASH object, we’ll once again need enough memory to hold the entire dataset.

Code:

data _null_;
   if (_n_ = 1) then do;
      declare hash myhash(dataset: "la.crime_sample" /*Input dataset*/
                         ,multidata: "yes" /*Allow duplicate keys*/
                         ,ordered:"ascending"); /*Sort the data by key*/
      rc = myhash.definekey('area'); /*Define the key*/
	/*List the variables you want in the output dataset*/
      rc = myhash.definedata('area','date_occ','time_occ'); 
      myhash.definedone();
      call missing(area, date_occ, time_occ); /*precludes a warning*/
   end; 
   /*Write out the output dataset*/
   rc = myhash.output(dataset:"work.crime_sample");
run;

Log:

NOTE: DATA statement used (Total process time):
      real time           0.10 seconds
      cpu time            0.01 seconds

The time required to sort the data is about the same as when using PROC SORT, and PROC SORT has more options and flexibility, so while it can be done, I wouldn’t consider using this technique unless I needed the hash object for some other process in the DATA step.

Now, one of the primary reasons I've had to sort data in the past is to perform a DATA step merge. When a BY statement is present in SAS code (except for PROC SORT of course), SAS assumes you have pre-sorted the data, and an error is generated if an out-of-sequence value is encountered. Did you know that one of the joys of using SAS/ACCESS to work with database tables in SAS is that you never have to pre-sort database data? Data is retrieved from the database using generated SQL and, if a BY statement is found in the SAS code, an appropriate ORDER BY clause is automatically included. Pretty slick, no? Can you name other conditions where a BY statement doesn't require pre-sorted data? I can think of at least one 😁

Until next time, may the SAS be with you!
Mark

Grab the ZIP file containing the code and data for this blog series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip

Links to prior posts in this series:

Part 1 – Append