Sorting problem in Proc SQL

Xing · Posted 07-03-2019 01:07 PM

Hi SAS experts,

I use the sas studio background submit to run my code and here is the log:

402        proc sql;
403        	create table xing.outp_serv_sort as
404        	select *
405        		from xing.outp_serv
406        		order by enrolid, svcdate;
ERROR: Sort execution failure.

NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
407        	
408        	/*remove duplicates w/ the same enrolid and svcdate*/
409        	create table xing.outp_serv_noduprecs1 as
410        	select distinct enrolid, svcdate
411        		from xing.outp_serv_sort
412        		order by enrolid, svcdate;
NOTE: Statement not executed due to NOEXEC option.
413        	
414        	/*remove duplicates w/ the same enrolid, svcdate, and procgrp*/
415        	create table xing.outp_serv_noduprecs2 as
416        	select distinct enrolid, svcdate, procgrp
417        		from xing.outp_serv_sort
418        		order by enrolid, svcdate, procgrp;
NOTE: Statement not executed due to NOEXEC option.
419        quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
      real time           1:39.93
      cpu time            38.82 seconds

It says the sort execution failed (line 402 to 406). The table xing.outp_serv has more than 5 billion observations. Is it due to the table too big and cannot be sorted? How can I fix it?

Thanks!

koyelghosh · Posted 07-03-2019 01:34 PM

Please check this post and the reply (https://communities.sas.com/t5/SAS-Procedures/ERROR-Sort-execution-failure-in-PROC-SQL/m-p/262581#M5...).

I am quoting from the referred link, in the above post.

"

The SAS® SORT and SQL procedures (PROC SORT and PROC SQL) might fail with these errors:

ERROR: Sort initialization failure.

ERROR: Sort execution failure.

Common reasons for these errors include the following:

insufficient disk space available when required
insufficient memory available when required
insufficient UNIX user limits on system resource or disk quota
incorrect setting of a SAS system option or operating system parameter

"

hashman · Posted 07-04-2019 12:30 AM

@Xing :

Your system resources are obviously inadequate for sorting the entire huge file xing.outp_serv, so SQL fails on the first query.

However, for the apparent results you are seeking - to wit, getting the sets of the unique values of the composite keys (enrolid, svcdate) and (enrolid, svcdate, procgrp) - your program does a lot of unnecessary work, and this is exactly where it bombs.

You need only 2 and 3 variables in the end, and yet you start by sorting the entire file with all the (who knows how many) variables you have there. Truth be told, you don't need your query #1 at all: Queries #2 and #3 can run directly against the original xing.outp.serv file just fine all by themselves. Furthermore, you can drop the ORDER clauses in them, as the outputs will be in the required order by virtue of the internal algorithm behind the fulfilling of the DISTINCT request. Also, it's much more efficient to execute query #3 first and then run query #2 against its result set. In other words, this should work:

proc sql ;                                 
  create table xing.outp_serv_noduprecs2 as
  select distinct enrolid, svcdate, procgrp
  from   xing.outp_serv                    
  ;                                        
  create table xing.outp_serv_noduprecs1 as
  select distinct enrolid, svcdate         
  from   xing.outp_serv_noduprecs2         
  ;                                        
quit ;

Kind regards

Paul D.

ChrisNZ · Posted 07-04-2019 01:08 AM

Also, proc sort is more efficient at removing duplicates.

data T;
  do I=1 to 5e2;
  do J=1 to 5e2;
  do K=1 to 1e2;
    output; 
  end; end; end;
run;

proc sort data=T(keep=I J) out=T1 nodupkey noequals; by I J; run;

proc sql; create table T2 as select unique I, J from T order by I, J; quit;

NOTE: PROCEDURE SORT used (Total process time):
real time 7.64 seconds
user cpu time 11.01 seconds
system cpu time 1.82 seconds

NOTE: PROCEDURE SQL used (Total process time):
real time 10.50 seconds
user cpu time 12.26 seconds
system cpu time 4.04 seconds

High-Performance SAS Coding - Third Edition

hashman · Posted 07-04-2019 02:09 AM

@ChrisNZ :

NOEQUALS rules. But with keys like these, key-indexing will blow the doors off everything else:

data T1 ;                       
  array h [250500] _temporary_ ;
  set T ;                       
  _n_ = I * 5e2 + J ;           
  if not h[_n_] ;              
  h[_n_] = 1 ;                 
run ;

The much more versatile (i.e. not relying on key properties) hash object is no slouch, either:

data T1 ;                    
  if _n_ = 1 then do ;       
    dcl hash h (hashexp:20) ;
    h.definekey ("I", "J") ; 
    h.definedone () ;        
  end ;                      
  set T ;                    
  if h.check() ne 0 ;        
  h.add() ;                  
run ;

Also, in my tests both SORT and SQL grab over 1G of RAM to get the job done, while the key-indexing and hash object use 2M and 32M, respectively. I haven't tried SORT with SORTSIZE (or SQL with REALMEMSIZE) limited to 32M , but I'll bet that with that limit and resulting swapping, both will drag their feet for a whole lot longer.

Kind regards

Paul D.

Kurt_Bremser · Posted 07-04-2019 03:19 AM

I just ran a test of all four methods, and got this log:

36         proc sort data=T(keep=I J) out=T1 nodupkey noequals; by I J; run;

NOTE: 24750000 observations with duplicate key values were deleted.
2                                                          Das SAS System                               08:06 Thursday, July 4, 2019

NOTE: The data set WORK.T1 has 250000 observations and 2 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           5.21 seconds
      user cpu time       2.93 seconds
      system cpu time     0.08 seconds
      memory              193709.28k
      OS Memory           217440.00k
      Timestamp           04.07.2019 09:10:15 vorm.
      Step Count                        12  Switch Count  11
      Page Faults                       0
      Page Reclaims                     50380
      Page Swaps                        0
      Voluntary Context Switches        602
      Involuntary Context Switches      918
      Block Input Operations            0
      Block Output Operations           0
      

37         
38         proc sql;
38       !           create table T2 as select unique I, J from T order by I, J;
NOTE: Table WORK.T2 created, with 250000 rows and 2 columns.

38       !                                                                       quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           15.53 seconds
      user cpu time       5.09 seconds
      system cpu time     0.67 seconds
      memory              193332.84k
      OS Memory           216380.00k
      Timestamp           04.07.2019 09:10:31 vorm.
      Step Count                        13  Switch Count  11
      Page Faults                       0
      Page Reclaims                     50249
      Page Swaps                        0
      Voluntary Context Switches        684
      Involuntary Context Switches      1897
      Block Input Operations            0
      Block Output Operations           0
      

39         
40         data T3 ;
41           array h [250500] _temporary_ ;
42           set T ;
43           _n_ = I * 5e2 + J ;
44           if not h[_n_] ;
45           h[_n_] = 1 ;
46         run ;

NOTE: There were 25000000 observations read from the data set WORK.T.
NOTE: The data set WORK.T3 has 250000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           4.97 seconds
      user cpu time       1.37 seconds
      system cpu time     0.07 seconds
      memory              2602.43k
      OS Memory           26704.00k
3                                                          Das SAS System                               08:06 Thursday, July 4, 2019

      Timestamp           04.07.2019 09:10:36 vorm.
      Step Count                        14  Switch Count  2
      Page Faults                       0
      Page Reclaims                     495
      Page Swaps                        0
      Voluntary Context Switches        11
      Involuntary Context Switches      326
      Block Input Operations            0
      Block Output Operations           0
      

47         
48         data T4 ;
49           if _n_ = 1 then do ;
50             dcl hash h (hashexp:20) ;
51             h.definekey ("I", "J") ;
52             h.definedone () ;
53           end ;
54           set T ;
55           if h.check() ne 0 ;
56           h.add() ;
57         run ;

NOTE: There were 25000000 observations read from the data set WORK.T.
NOTE: The data set WORK.T4 has 250000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           13.26 seconds
      user cpu time       3.61 seconds
      system cpu time     0.07 seconds
      memory              33555.28k
      OS Memory           57588.00k
      Timestamp           04.07.2019 09:10:49 vorm.
      Step Count                        15  Switch Count  26
      Page Faults                       2
      Page Reclaims                     7709
      Page Swaps                        0
      Voluntary Context Switches        81
      Involuntary Context Switches      1316
      Block Input Operations            0
      Block Output Operations           0

(performed on a pSeries with two POWER7 (or -8) cores, and a solid SAN for storage; also there's a severe MEMSIZE limit to enable lots of parallel SAS processes without paging)

SORT and the array method are close, strongly beating both the hash and the SQL.

Since the SORT has to make the least assumptions (read: none) about expected size, and needs the simplest code, it's still the preferred tool.

Bottom line (for me): if you need a certain order, PROC SORT rules the roost.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

hashman · Posted 07-04-2019 03:52 AM

@Kurt_Bremser :

Apparently, much depends on the system set up. I ran my tests on the very laptop I'm typing this, a very old ThinkPad W520 i7 with SAS 9.4 1M4 under X64_7PRO, 2.4 GHz, 8G RAM, and 500G SSD. MEMSIZE=REALMEMSIZE=SORTSIZE=4G. My average figures over 10 runs (real time in seconds / memory in MB):

SORT (NOEQUALS): 5.3 / 1028
SQL: 6.2 / 1027
HASH: 4.6 / 32
KEY-INDEX: 2.2 / 2

Kind regards

Paul D.

koyelghosh · Posted 07-04-2019 04:26 AM

@hashman That is a nice, succinct but very useful table.

ChrisNZ · Posted 07-04-2019 07:29 PM

Hash is not faster than proc sort for me, but uses less CPU (and much less memory).

data T1 ;                    
  if _n_ = 1 then do ;       
    dcl hash h (hashexp:20) ;
    h.definekey ("I", "J") ; 
    h.definedone () ;        
  end ;                      
  set T ;                    
  if h.check() ne 0 ;        
  h.add() ;                  
run ;  
 
data _null_ ;    
  call missing(I,J);               
  dcl hash h (dataset: 'T', hashexp:20, ordered:'a') ;
  h.definekey ("I", "J") ; 
  h.definedone () ;        
  h.output(dataset:'T1') ;                  
run ;

NOTE: DATA statement used (Total process time):
real time 11.23 seconds
user cpu time 8.84 seconds
system cpu time 1.56 seconds

NOTE: DATA statement used (Total process time):
real time 9.50 seconds
user cpu time 7.75 seconds
system cpu time 1.25 seconds

High-Performance SAS Coding - Third Edition

Xing · Posted 07-04-2019 03:52 PM

@hashman

Hi, I used the codes you provided and the log says insufficient space in WORK. Any further suggestions?

73         proc sql;
74         	/*remove duplicates w/ the same enrolid, svcdate, and procgrp*/
75         	create table xing.outp_serv_noduprecs2 as
76         	select distinct enrolid, svcdate, procgrp
77         		from xing.outp_serv;
ERROR: Insufficient space in file WORK.'SASTMP-000000018'n.UTILITY.
ERROR: File WORK.'SASTMP-000000018'n.UTILITY is damaged. I/O processing did not complete.
NOTE: Error was encountered during utility-file processing. You may be able to execute the SQL statement successfully if you 
      allocate more space to the WORK library.

NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
78         		
79         	/*remove duplicates w/ the same enrolid and svcdate*/
80         	create table xing.outp_serv_noduprecs1 as
81         	select distinct enrolid, svcdate
82         		from xing.outp_serv;
NOTE: Statement not executed due to NOEXEC option.
83         quit;

SASKiwi · Posted 07-04-2019 04:22 PM

You have run out of space in your SAS WORK library. Assuming you are using a remote SAS server, talk to your SAS administrator regarding an increase in SAS WORK space.

ChrisNZ · Posted 07-04-2019 07:31 PM

Use a hash table as shown, if the results suit your needs.

The memory requirements are much much lower.

High-Performance SAS Coding - Third Edition

hashman · Posted 07-04-2019 08:03 PM

@Xing :

First, I'd heed to advise given by @SASKiwi but would also urge you to clean up your work library before you resubmit the job just in case you have too much leftover junk there by running, for example:

proc delete data = work._all_ ;
run ;

If the job still fails and your admin isn't responsive enough to your needs, you'll have to get craftier, as many old-timers used to when trying to get more utility or sort DASD space was akin to starting a small war.

First, the execution algorithm behind SQL is obviously I/O bound; hence you can try to move the burden to memory - provided, of course, that you have enough. In other words, you can try using a hash instead, e.g.:

data xing.outp_serv_noduprecs2 ;                         
  if _n_ = 1 then do ;                              
    dcl hash h () ;                                 
    h.definekey  ("enrolid", "svcdate", "procgrp") ;
    h.definedata ("enrolid", "svcdate", "procgrp") ;
    h.definedone() ;                                
  end ;                                             
  set xing.outp_serv (keep = enrolid svcdate procgrp) ;       
  if h.check() ne 0 ;                               
  h.add() ;                                         
run ;

This code will preserve the original relative order of the input records. If you want the output ordered, run instead:

data _null_ ;                                          
  if _n_ = 1 then do ;                                 
    dcl hash h (ordered:"A") ;                         
    h.definekey  ("enrolid", "svcdate", "procgrp") ;   
    h.definedata ("enrolid", "svcdate", "procgrp") ;   
    h.definedone() ;                                   
  end ;                                                
  set xing.outp_serv (keep = enrolid svcdate procgrp) end = z ;  
  h.ref() ;                                            
  if z then h.output (dataset:"xing.outp_serv_noduprecs2") ;
run ;

By the nature of the algorithm, memory usage for both variants above will be the same since in both, at the end the hash table contains all unique tuples of (enrolid, svcdate, procgrp).

If you discover that it works (i.e. you have more than enough memory for the task), then you can use the output to create the second file, xing.outp_serv_noduprecs1, using either your second SQL query. Or, alternatively, you can use the hash object again, for instance:

data _null_ ;                                                        
  if _n_ = 1 then do ;                                               
    dcl hash h (ordered:"A") ;                                       
    h.definekey  ("enrolid", "svcdate") ;                            
    h.definedata ("enrolid", "svcdate") ;                            
    h.definedone() ;                                                 
  end ;                                                              
  set xing.outp_serv_noduprecs2 (keep = enrolid svcdate procgrp) end = z ;
  h.ref() ;                                                          
  if z then h.output (dataset:"xing.outp_serv_noduprecs1") ;              
run ;

In fact, if your SAS session is awash with RAM, so that it can hold both hash tables at once, both files can be created in a single pass through the input data:

data _null_ ;                                               
  dcl hash h1 (ordered:"A") ;                               
  h1.definekey  ("enrolid", "svcdate") ;                    
  h1.definedata ("enrolid", "svcdate") ;                    
  h1.definedone() ;                                         
  dcl hash h2 (ordered:"A") ;                               
  h2.definekey  ("enrolid", "svcdate", "procgrp") ;         
  h2.definedata ("enrolid", "svcdate", "procgrp") ;         
  h2.definedone() ;                                         
  do until (z) ;                                            
    set xing.outp_serv (keep = enrolid svcdate procgrp) end = z ;
    h1.ref() ;                                              
    h2.ref() ;                                              
  end ;                                                     
  h1.output (dataset:"xing.outp_serv_noduprecs1") ;              
  h2.output (dataset:"xing.outp_serv_noduprecs2") ;              
run ;

Should all of the above still fail, you'd have to get still craftier and split your input into N key-independent subgroups, sacrificing some efficiency by making N passes through the input data. However, it virtually guarantees that if N is high enough, your job will eventually succeed. Since it's a topic a bit too voluminous for a reply to a question here, I'd merely suggest that you read this paper:

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/1755-2018.pdf

Kind regards

Paul D.

koyelghosh · Posted 07-04-2019 01:29 AM

@Xing I have a suggestion, on similar lines to the responses given by other members.

Your first SQL step is likely to be the most costly. I see that you are feeding one SQL step into the other. If you are not using first SQL block anywhere else then I suggest you start with second (or may be even the third and last) SQL block and defer the ORDER BY to the third and last SQL block. I see that all you want is distinct values of enrolid, svcdate and procgrp in the end sorted by the same variables. The DISTINCT clause should work before ORDER BY and that way you will have less number of rows to sort.

In summary defer the ORDER BY as late as possible.

I might be wrong here but worth a try.

Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Re: Sorting problem in Proc SQL

Registration is open

SAS Training: Just a Click Away