Allow PROC SORT to output multiple datasets - Page 2

mkeintz · ‎06-01-2021

Frequently a SAS programmer needs to both sort and divide a dataset. Typically these tasks are satisfied via a two stage program: (1) divide the dataset, (2) sort the parts. Something like:

data  _asia (where=(origin='Asia'))  cars_europe (where=(origin='Europe'))
      cars_usa (where=(origin='USA')) /view=_asia;
  set sashelp.cars;
run;
proc sort data=_asia out=cars_asia;
  by msrp;
run;
proc sort data=cars_europe;
  by msrp;
run;
proc sort data=cars_usa;
  by msrp;
run;

Enabling proc sort to do both of these tasks (with syntax such as below) would often significantly reduce input/output activity, and offer a bit of code simplification:

proc sort data=sashelp.cars
  out=cars_asia   (where=(origin='Asia'))
      cars_europe (where=(origin='Europe'))
      cars_usa    (where=(origin='USA'))    ;
  by msrp;
run;

Of course, the multiple output datasets need not be mutually exclusive like the example above.

mkeintz · ‎06-06-2021

I like the use of the output view, which I expected to replicate the proc sort modification I am requesting

But either my initial conjecture of saving disk i/o via multiple outputs from proc sort is wrong, or the use of the output view is not a good simulation of multiple proc sort outputs.

I tested using this data set:

data cars_times_10000;
  set sashelp.cars;
  length  blk $1000;  /*Fatten the record */
  retain blk '';
  output;
  do i=1 to 9999;
    msrp=msrp+1;
    output;
  end;
run;

The program below (using my initial "current best practice") took a total of

Real Time=14.2 seconds, User CPU time=5.9 seconds, System CPU time=13.5 seconds

data va (where=(origin='Asia'))
     e (where=(origin='Europe'))
	 u (where=(origin='USA'))
  /view=va;
  set cars_times_10000;
run;

proc sort data=va out=a; by msrp descending i;run;

proc sort data=e  out=e; by msrp descending i;run;

proc sort data=u  out=u; by msrp descending i;run;

And below is the code using an output view. It takes about five times as long, with:

Real Time=75.2 seconds, User CPU time=43.3 seconds, System CPU time=35.6 seconds

data a(where=(origin='Asia'))
     e(where=(origin='Europe'))
     u(where=(origin='USA')) 
     / 
     view=split_sort;

  if 0 then set cars_times_10000;
  set split_sort;
run;

proc sort data=cars_times_10000 out=split_sort; 
  by msrp descending i;
run;

jimbarbour · ‎06-18-2021

@mkeintz,

I ran your code just to see what it would do on my SAS 9.4 M6 Windows 64 bit system. It was about 30 seconds total real time for the "three sort" method. The "triple view, one sort" method is a little harder to interpret. There are not one but two "real times" listed for the Sort step. I think one is preliminary and the other final. In other words, I think it was about 1 minute, 9 seconds total (not 2 minutes, 18 seconds). I suppose the first "real time" is to read in the data (execute the view) and the second real time adds in the sort. The Sort appears to be lightning fast, but the up front cost (executing the view) is more than double the time the "three sort" method took.

NOTE: There were 4280000 observations read from the data set WORK.CARS_TIMES_10000.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.SPLIT_SORT has 4280000 observations and 17 variables.
NOTE: View WORK.SPLIT_SORT.VIEW used (Total process time):
      real time           0:01:09.01
      cpu time            0:01:11.95
      Timestamp           June 18, 2021 17:11:50
      
NOTE: The data set WORK.A has 1580000 observations and 17 variables.
NOTE: Compressing data set WORK.A decreased size by 90.71 percent. 
      Compressed is 2622 pages; un-compressed would require 28215 pages.
NOTE: The data set WORK.E has 1230000 observations and 17 variables.
NOTE: Compressing data set WORK.E decreased size by 90.55 percent. 
      Compressed is 2076 pages; un-compressed would require 21965 pages.
NOTE: The data set WORK.U has 1470000 observations and 17 variables.
NOTE: Compressing data set WORK.U decreased size by 90.59 percent. 
      Compressed is 2470 pages; un-compressed would require 26251 pages.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0:01:09.12
      cpu time            0:01:12.04
      Timestamp           June 18, 2021 17:11:51

Maybe when it moves beyond experimental, the run times will be better than the "three sort" method. @FriedEgg, any "ballpark" projection when this might go to full Production? I assume that it would be for SAS 9.4 not Viya, perhaps that's not a valid assumption. I'm not sure further releases of 9.4 are forthcoming given SAS's emphasis on Viya.

Jim

ChrisNZ · ‎07-25-2021

Views incur a high performance hit, and by reading this it looks like output views are even worse.

Also, note that since the views use a data step, they have the disadvantage of not setting the SORTEDBY flag, which has massive performance consequences as SAS does not know the table is sorted and sorts it again is needed. This is a high priority issue for me: a small change in the data step output logic yields huge benefits.

PhilC · ‎08-26-2021

2 1/2 months later... so interesting. I considered this being like a SQL view, which sorts and is "piped" to a data step that splits.

proc sql;
  create view sorted_view as 
    select *
    from cars_times_10000
    order by msrp;

data a(where=(origin='Asia'))
     e(where=(origin='Europe'))
     u(where=(origin='USA')) 
    ;
 set sorted_view;
   by msrp;
run;

So I'm so embarrassed now that my system is SO slow. I benched marked this against Keintz's and this performs 50% longer than his 'best practice" on my system.

This performs equivalent to Keintz's best practice. I have a new and increased faith in PROC SQL's optimizations.

proc sql;
  create view sorted_view as 
    select *
    from cars_times_10000
    order by msrp;
  create table a3 as 
    select *
      from sorted_view
      where origin='Asia';
  create table e3 as 
    select *
      from sorted_view
      where origin='Europe';
  create table u3 as 
    select *
      from sorted_view
      where origin='USA';
quit;

mkeintz · ‎08-26-2021

@PhilC Your code has

proc sql;
  create view sorted_view as 
    select *
    from cars_times_10000
    order by msrp;

*** followed by other code ***;

But it sorts only by MSRP. My sample sorted by MSRP descending i. And since the data started out with ascending i=1 to 10000 for each record, there is definitely more sort work to be done than in your example. I'd be interested to know whether the proc sql performance holds up if you use

order by msrp, descending i;

BTW, it's Keintz, not Kintz

PhilC · ‎08-26-2021

it did!

I didn't think that the descending or ascending made a difference. My first run of the descending case indicates that if it does make a difference, it's a small difference.

Please excuse my mistake, it's corrected.