Solved: Re: PROC SORT efficiency consideration

aaronh · Posted 01-29-2020 02:26 PM

Suppose I have one data set of economic data by US States (named "master"), and I want to divide this data set by individual states.

I want to be able to perform DATA STEP merge later using the individual state data, so I need PROC SORT (or SQL order by) before I can do this.

Considering the size of my "master" data (50 GB), I am curious which of the following will be more efficient and why. For demonstration purposes, I am only pulling data for 4 US states: NC, SC, GA, and FL. The target file that I want is called sorted_NC, sorted_SC, sorted_GA, and sorted_FL, respectively.

OPTION 1: Passing Through "Master" Table Multiple Times

%macro sort_state(location);
proc sort data = master (where=(state="&location"))
out = sorted_&location;
by var1 var2;
run;
%mend sort_state;

%sort_state(NC)
%sort_state(SC)
%sort_state(GA)
%sort_state(FL)

OPTION 2: Passing Through "Master" Table One Time Only, Then Sort

data NC SC GA FL;
set master;
if state = "NC" then output NC;
if state = "SC" then output SC;
if state = "GA" then output GA;
if state = "FL" then output FL;
run;

%macro sort2(location);
proc sort data = &location.
out = sorted_&location.;
by var1 var2;
run;
%mend;

%sort2(NC)
%sort2(SC)
%sort2(GA)
%sort2(FL)

I really appreciate any input/comment you have.

-AH

ballardw · Posted 01-29-2020 02:54 PM

If Master doesn't change frequently they you a doing more work than needed. Sort the Master set by State and other variables used in a merge by and use a Where data set option to select the records to merge.

proc sort data=master;
   by state var1 var2;
run;

/* merge example
   assumes your HAVE set is 
   sorted by the same BY
   variables
*/

data want;
   merge have
         master (where=(state in ('AL' 'GA') ))
   ;
   by state var1 var2 ;
run;

The sort for Master should only need to be done once (I assume that you actually have it in a permanent library so the set is always available).

If you aren't merging on the STATE variable then sort by just the Var1 and Var2. The where option would still be used to subset the data.

Another option would be proc sql. Without a more concrete example I can't provide much of dummy code but here is a skeleton:

proc sql;
   create table want as
   select a.var1, a.var2, a.var3
          ,b.somevar, b.somevar2, b.somevar3
   from yourworkingset as a
        left join
        (select * from Master
         where state='AL') as b
      on a.var1=b.var1
      and a.var2=b.var2
   ;
quit;

There can be some performance issues if the master is big enough but you need not actually sort it as the WHERE selects the records you are interested on and the ON specifies how to match the records from the two sets. The A and B are aliases to indicate which data set the referenced variable comes from. If you use different "by" variables to combine the data often then this may be a more flexible approach than constantly creating multiple data sets.

View solution in original post

ballardw · Posted 01-29-2020 02:54 PM

If Master doesn't change frequently they you a doing more work than needed. Sort the Master set by State and other variables used in a merge by and use a Where data set option to select the records to merge.

proc sort data=master;
   by state var1 var2;
run;

/* merge example
   assumes your HAVE set is 
   sorted by the same BY
   variables
*/

data want;
   merge have
         master (where=(state in ('AL' 'GA') ))
   ;
   by state var1 var2 ;
run;

The sort for Master should only need to be done once (I assume that you actually have it in a permanent library so the set is always available).

If you aren't merging on the STATE variable then sort by just the Var1 and Var2. The where option would still be used to subset the data.

Another option would be proc sql. Without a more concrete example I can't provide much of dummy code but here is a skeleton:

proc sql;
   create table want as
   select a.var1, a.var2, a.var3
          ,b.somevar, b.somevar2, b.somevar3
   from yourworkingset as a
        left join
        (select * from Master
         where state='AL') as b
      on a.var1=b.var1
      and a.var2=b.var2
   ;
quit;

There can be some performance issues if the master is big enough but you need not actually sort it as the WHERE selects the records you are interested on and the ON specifies how to match the records from the two sets. The A and B are aliases to indicate which data set the referenced variable comes from. If you use different "by" variables to combine the data often then this may be a more flexible approach than constantly creating multiple data sets.

aaronh · Posted 01-30-2020 08:36 AM

Thank you for your detailed suggestions! One question I have is: if I use the data merge like you suggested, wouldn't SAS still need to go through the "master" table many times because I will have to run a DATA STEP merge for each individual state (each state has its own unique dataset to merge into)

aaronh · Posted 01-30-2020 10:22 AM

It turns out that running the PROC SORT first in this case saved me about 60% of time, as compared to the Option 1 I had. When I ran Option 1, it took a total of about 3 hours to complete, whereas the lines below finished within 1 hour. I haven't compared it with Option 2 yet, but I suspect it will be comparable.

Because the individual state data sets are all sorted by the same variables, I could just do:

proc sort data = master
out = master_sorted;
by var1 var2;
run;

data NC SC GA FL;
set master_sorted;
if state = "NC" then output NC;
if state = "SC" then output SC;
if state = "GA" then output GA;
if state = "FL" then output FL;
run;

Thanks again for your input!

-AH

Satish_Parida · Posted 01-30-2020 04:47 AM

What is the issue, the space taken by temp files during sort or the time ?

1. Provided Space is your constraint then you have to do it like you have done.

2. If Time is constraint then you can go for multiple sessions running at same time on different splits of the dataset. We can use Proc sort thread as well.

So Could you pin point your area of pain.

PROC SORT efficiency consideration

Re: PROC SORT efficiency consideration

Re: PROC SORT efficiency consideration

Re: PROC SORT efficiency consideration

Re: PROC SORT efficiency consideration

Re: PROC SORT efficiency consideration

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!