BookmarkSubscribeRSS Feed

Frequently a SAS programmer needs to both sort and divide a dataset. Typically these tasks are satisfied via a two stage program:   (1) divide the dataset, (2) sort the parts. Something like:

 

data  _asia (where=(origin='Asia'))  cars_europe (where=(origin='Europe'))
      cars_usa (where=(origin='USA')) /view=_asia;
  set sashelp.cars;
run;
proc sort data=_asia out=cars_asia;
  by msrp;
run;
proc sort data=cars_europe;
  by msrp;
run;
proc sort data=cars_usa;
  by msrp;
run;

Enabling proc sort to do both of these tasks (with syntax such as below) would often significantly reduce input/output activity, and offer a bit of code simplification: 

proc sort data=sashelp.cars
  out=cars_asia   (where=(origin='Asia'))
      cars_europe (where=(origin='Europe'))
      cars_usa    (where=(origin='USA'))    ;
  by msrp;
run;

 

Of course, the multiple output datasets need not be mutually exclusive like the example above.

18 Comments
mkeintz
PROC Star

I like the use of the output view, which I expected to replicate the proc sort modification I am requesting

 

But either my initial conjecture of saving disk i/o via multiple outputs from proc sort is wrong, or the use of the output view is not a good simulation of multiple proc sort outputs.

 

I tested using this data set:

data cars_times_10000;
  set sashelp.cars;
  length  blk $1000;  /*Fatten the record */
  retain blk '';
  output;
  do i=1 to 9999;
    msrp=msrp+1;
    output;
  end;
run;

The program below (using my initial "current best practice") took a total of

Real Time=14.2 seconds, User CPU time=5.9 seconds, System CPU time=13.5 seconds

data va (where=(origin='Asia'))
     e (where=(origin='Europe'))
	 u (where=(origin='USA'))
  /view=va;
  set cars_times_10000;
run;

proc sort data=va out=a; by msrp descending i;run;

proc sort data=e  out=e; by msrp descending i;run;

proc sort data=u  out=u; by msrp descending i;run;

And below is the code using an output view. It takes about five times as long, with:

Real Time=75.2 seconds, User CPU time=43.3 seconds, System CPU time=35.6 seconds

data a(where=(origin='Asia'))
     e(where=(origin='Europe'))
     u(where=(origin='USA')) 
     / 
     view=split_sort;

  if 0 then set cars_times_10000;
  set split_sort;
run;

proc sort data=cars_times_10000 out=split_sort; 
  by msrp descending i;
run;
jimbarbour
Meteorite | Level 14

@mkeintz,

 

I ran your code just to see what it would do on my SAS 9.4 M6 Windows 64 bit system. It was about 30 seconds total real time for the "three sort" method.  The "triple view, one sort" method is a little harder to interpret.  There are not one but two "real times" listed for the Sort step.  I think one is preliminary and the other final.  In other words, I think it was about 1 minute, 9 seconds total (not 2 minutes, 18 seconds).  I suppose the first "real time" is to read in the data (execute the view) and the second real time adds in the sort.  The Sort appears to be lightning fast, but the up front cost (executing the view) is more than double the time the "three sort" method took.

 

NOTE: There were 4280000 observations read from the data set WORK.CARS_TIMES_10000.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.SPLIT_SORT has 4280000 observations and 17 variables.
NOTE: View WORK.SPLIT_SORT.VIEW used (Total process time):
      real time           0:01:09.01
      cpu time            0:01:11.95
      Timestamp           June 18, 2021 17:11:50
      
NOTE: The data set WORK.A has 1580000 observations and 17 variables.
NOTE: Compressing data set WORK.A decreased size by 90.71 percent. 
      Compressed is 2622 pages; un-compressed would require 28215 pages.
NOTE: The data set WORK.E has 1230000 observations and 17 variables.
NOTE: Compressing data set WORK.E decreased size by 90.55 percent. 
      Compressed is 2076 pages; un-compressed would require 21965 pages.
NOTE: The data set WORK.U has 1470000 observations and 17 variables.
NOTE: Compressing data set WORK.U decreased size by 90.59 percent. 
      Compressed is 2470 pages; un-compressed would require 26251 pages.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0:01:09.12
      cpu time            0:01:12.04
      Timestamp           June 18, 2021 17:11:51

Maybe when it moves beyond experimental, the run times will be better than the "three sort" method.  @FriedEgg, any "ballpark" projection when this might go to full Production?  I assume that it would be for SAS 9.4 not Viya, perhaps that's not a valid assumption.  I'm not sure further releases of 9.4 are forthcoming given SAS's emphasis on Viya.

 

Jim

ChrisNZ
Tourmaline | Level 20

Views incur a high performance hit, and by reading this it looks like output views are even worse.

 

Also, note that since the views use a data step, they have the disadvantage of not setting the SORTEDBY flag, which has massive performance consequences as SAS does not know the table is sorted and sorts it again is needed. This is a high priority issue for me: a small change in the data step output logic yields huge benefits.

 

PhilC
Rhodochrosite | Level 12

2 1/2 months later... so interesting.  I considered this being like a SQL view, which sorts and is "piped" to a data step that splits.

proc sql;
  create view sorted_view as 
    select *
    from cars_times_10000
    order by msrp;

data a(where=(origin='Asia'))
     e(where=(origin='Europe'))
     u(where=(origin='USA')) 
    ;
 set sorted_view;
   by msrp;
run;

So I'm so embarrassed now that my system is SO slow.  I benched marked this against Keintz's and this performs 50% longer than his 'best practice" on my system.  

 

This performs equivalent to Keintz's best practice.  I have a new and increased faith in PROC SQL's optimizations.

proc sql;
  create view sorted_view as 
    select *
    from cars_times_10000
    order by msrp;
  create table a3 as 
    select *
      from sorted_view
      where origin='Asia';
  create table e3 as 
    select *
      from sorted_view
      where origin='Europe';
  create table u3 as 
    select *
      from sorted_view
      where origin='USA';
quit;

 

mkeintz
PROC Star

@PhilC    Your code has

 

proc sql;
  create view sorted_view as 
    select *
    from cars_times_10000
    order by msrp;

*** followed by other code ***;

But it sorts only by MSRP.  My sample sorted by MSRP descending i.   And since the data started out with ascending i=1 to 10000 for each record, there is definitely more sort work to be done than in your example.  I'd be interested to know whether the proc sql performance holds up if you use

    order by msrp, descending i;

 

BTW, it's Keintz, not Kintz

PhilC
Rhodochrosite | Level 12

it did!

I didn't think that the descending or ascending made a difference.  My first run of the descending case indicates that if it does make a difference, it's a small difference. 

 

Please excuse my mistake, it's corrected.

rmacarthur
Pyrite | Level 9

Agree this would be useful !

Patrick
Opal | Level 21

This "Output Data Step View" as shown by @FriedEgg looks interesting especially should it also work for Viya in-cas processing (=a view for loading data into CAS).

Reducing I/O is normally driven by performance requirements but it looks like the current implementation of these output views won't help with that (see log below).

As long as a table fits into memory I'd be using a hash object for such a requirement. 

 

Code:

Spoiler
options fullstimer;
data work.cars(drop=_:);
	do _i=1 to _nobs;
		do _j=1 to 10000;
			set sashelp.cars nobs=_nobs point=_i;
			output;
		end;
	end;
	stop;
run;

/* option 1: output view */
data cars_asia(where=(origin='Asia'))
     cars_europe(where=(origin='Europe'))
     cars_usa(where=(origin='USA')) 
     / 
     view=split_sort;

  if 0 then set work.cars;
  set split_sort;
run;

proc sort data=work.cars out=split_sort; 
  by msrp;
run;

/* option 2A: Hash table */
data _null_;
	if 0 then set work.cars;
	dcl hash h1(dataset:'work.cars', multidata:'y', ordered:'y');
	h1.defineKey('origin');
	h1.defineData(all:'y');
	h1.defineDone();
	h1.output(dataset:"cars_asia(where=(origin='Asia'))");
	h1.output(dataset:"cars_asia(where=(origin='Europe'))");
	h1.output(dataset:"cars_asia(where=(origin='USA'))");
run;

/* option 2B: Hash table with dynamic output */
data _null_;
	dcl hash h1(dataset:'work.cars(obs=0)', multidata:'y', ordered:'y');
	h1.defineKey('origin');
	h1.defineData(all:'y');
	h1.defineDone();
	dcl hash h2(ordered:'y', multidata:'n');
	h2.defineKey('origin');
	h2.defineDone();
	dcl hiter hh2('h2');
	do _i=1 to _nobs;
		set work.cars nobs=_nobs point=_i;
		_rc=h1.add();
		_rc=h2.ref();
	end;
	
	_rc=hh2.first();
	do while(_rc=0);
		_rc=h1.output(dataset: cats('cars_',origin,'(where=(origin="',origin,'"))') );
		_rc=hh2.next();
	end;
	stop;
run;

Log:

Spoiler
 
 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 68         
 69         options fullstimer;
 70         data work.cars(drop=_:);
 71         do _i=1 to _nobs;
 72         do _j=1 to 10000;
 73         set sashelp.cars nobs=_nobs point=_i;
 74         output;
 75         end;
 76         end;
 77         stop;
 78         run;
 
 NOTE: The data set WORK.CARS has 4280000 observations and 15 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.86 seconds
       user cpu time       0.55 seconds
       system cpu time     0.31 seconds
       memory              1962.25k
       OS Memory           22948.00k
       Timestamp           02/02/2025 01:41:13 AM
       Step Count                        158  Switch Count  5
       Page Faults                       0
       Page Reclaims                     234
       Page Swaps                        0
       Voluntary Context Switches        24
       Involuntary Context Switches      9
       Block Input Operations            0
       Block Output Operations           1272584
       
 
 79         
 80         /* option 1: output view */
 81         data cars_asia(where=(origin='Asia'))
 82              cars_europe(where=(origin='Europe'))
 83              cars_usa(where=(origin='USA'))
 84              /
 85              view=split_sort;
 86         
 87           if 0 then set work.cars;
 88           set split_sort;
 89         run;
 
 NOTE: DATA STEP view saved on file WORK.SPLIT_SORT.
 NOTE: A stored DATA STEP view cannot run under a different operating system.
 WARNING: The definition of an output DATA step view is an experimental feature in this release and is not intended for use in the 
          development of production applications.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.01 seconds
       system cpu time     0.00 seconds
       memory              2415.43k
       OS Memory           23208.00k
       Timestamp           02/02/2025 01:41:13 AM
       Step Count                        159  Switch Count  2
       Page Faults                       0
       Page Reclaims                     305
       Page Swaps                        0
       Voluntary Context Switches        11
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           264
       
 
 90         
 91         proc sort data=work.cars out=split_sort;
 92           by msrp;
 93         run;
 
 NOTE: There were 4280000 observations read from the data set WORK.CARS.
 NOTE: The data set WORK.SPLIT_SORT has 4280000 observations and 15 variables.
 NOTE: View WORK.SPLIT_SORT.VIEW used (Total process time):
       real time           1:11.07
       user cpu time       27.64 seconds
       system cpu time     36.84 seconds
       memory              821505.54k
       OS Memory           845336.00k
       Timestamp           02/02/2025 01:42:24 AM
       Step Count                        160  Switch Count  4280007
       Page Faults                       0
       Page Reclaims                     54606
       Page Swaps                        0
       Voluntary Context Switches        8560583
       Involuntary Context Switches      318
       Block Input Operations            0
       Block Output Operations           1267232
       
 NOTE: The data set WORK.CARS_ASIA has 1580000 observations and 15 variables.
 NOTE: The data set WORK.CARS_EUROPE has 1230000 observations and 15 variables.
 NOTE: The data set WORK.CARS_USA has 1470000 observations and 15 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           1:11.13
       user cpu time       27.64 seconds
       system cpu time     36.89 seconds
       memory              821505.54k
       OS Memory           845336.00k
       Timestamp           02/02/2025 01:42:24 AM
       Step Count                        160  Switch Count  4280010
       Page Faults                       0
       Page Reclaims                     54696
       Page Swaps                        0
       Voluntary Context Switches        8560635
       Involuntary Context Switches      328
       Block Input Operations            0
       Block Output Operations           1273920
       
 
 94         
 95         /* option 2A: Hash table */
 96         data _null_;
 97         if 0 then set work.cars;
 98         dcl hash h1(dataset:'work.cars', multidata:'y', ordered:'y');
 99         h1.defineKey('origin');
 100        h1.defineData(all:'y');
 101        h1.defineDone();
 102        h1.output(dataset:"cars_asia(where=(origin='Asia'))");
 103        h1.output(dataset:"cars_asia(where=(origin='Europe'))");
 104        h1.output(dataset:"cars_asia(where=(origin='USA'))");
 105        run;
 
 NOTE: There were 4280000 observations read from the data set WORK.CARS.
 NOTE: The data set WORK.CARS_ASIA has 1580000 observations and 15 variables.
 NOTE: The data set WORK.CARS_ASIA has 1230000 observations and 15 variables.
 NOTE: The data set WORK.CARS_ASIA has 1470000 observations and 15 variables.
 NOTE: DATA STEP stopped due to looping.
 NOTE: DATA statement used (Total process time):
       real time           1.77 seconds
       user cpu time       1.18 seconds
       system cpu time     0.58 seconds
       memory              1000799.62k
       OS Memory           1023376.00k
       Timestamp           02/02/2025 01:42:26 AM
       Step Count                        161  Switch Count  9
       Page Faults                       0
       Page Reclaims                     7220
       Page Swaps                        0
       Voluntary Context Switches        46
       Involuntary Context Switches      21
       Block Input Operations            0
       Block Output Operations           1273120
       
 
 106        
 107        /* option 2B: Hash table with dynamic output */
 108        data _null_;
 109        dcl hash h1(dataset:'work.cars(obs=0)', multidata:'y', ordered:'y');
 110        h1.defineKey('origin');
 111        h1.defineData(all:'y');
 112        h1.defineDone();
 113        dcl hash h2(ordered:'y', multidata:'n');
 114        h2.defineKey('origin');
 115        h2.defineDone();
 116        dcl hiter hh2('h2');
 117        do _i=1 to _nobs;
 118        set work.cars nobs=_nobs point=_i;
 119        _rc=h1.add();
 120        _rc=h2.ref();
 121        end;
 122        
 123        _rc=hh2.first();
 124        do while(_rc=0);
 125        _rc=h1.output(dataset: cats('cars_',origin,'(where=(origin="',origin,'"))') );
 126        _rc=hh2.next();
 127        end;
 128        stop;
 129        run;
 
 NOTE: There were 0 observations read from the data set WORK.CARS.
 NOTE: The data set WORK.CARS_ASIA has 1580000 observations and 15 variables.
 NOTE: The data set WORK.CARS_EUROPE has 1230000 observations and 15 variables.
 NOTE: The data set WORK.CARS_USA has 1470000 observations and 15 variables.
 NOTE: DATA statement used (Total process time):
       real time           2.89 seconds
       user cpu time       2.26 seconds
       system cpu time     0.62 seconds
       memory              769862.90k
       OS Memory           792204.00k
       Timestamp           02/02/2025 01:42:29 AM
       Step Count                        162  Switch Count  17
       Page Faults                       0
       Page Reclaims                     7212
       Page Swaps                        0
       Voluntary Context Switches        71
       Involuntary Context Switches      16
       Block Input Operations            0
       Block Output Operations           1273112
       
 
 130        
 131        
 132        OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 142        

There aren't many use cases where splitting a table into multiple tables is the optimal design and for that reason I won't upvote the idea.