Hi SAS experts,
I use the sas studio background submit to run my code and here is the log:
402 proc sql;
403 create table xing.outp_serv_sort as
404 select *
405 from xing.outp_serv
406 order by enrolid, svcdate;
ERROR: Sort execution failure.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
407
408 /*remove duplicates w/ the same enrolid and svcdate*/
409 create table xing.outp_serv_noduprecs1 as
410 select distinct enrolid, svcdate
411 from xing.outp_serv_sort
412 order by enrolid, svcdate;
NOTE: Statement not executed due to NOEXEC option.
413
414 /*remove duplicates w/ the same enrolid, svcdate, and procgrp*/
415 create table xing.outp_serv_noduprecs2 as
416 select distinct enrolid, svcdate, procgrp
417 from xing.outp_serv_sort
418 order by enrolid, svcdate, procgrp;
NOTE: Statement not executed due to NOEXEC option.
419 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 1:39.93
cpu time 38.82 secondsIt says the sort execution failed (line 402 to 406). The table xing.outp_serv has more than 5 billion observations. Is it due to the table too big and cannot be sorted? How can I fix it?
Thanks!
Please check this post and the reply (https://communities.sas.com/t5/SAS-Procedures/ERROR-Sort-execution-failure-in-PROC-SQL/m-p/262581#M5...).
I am quoting from the referred link, in the above post.
"
The SAS® SORT and SQL procedures (PROC SORT and PROC SQL) might fail with these errors:
Common reasons for these errors include the following:
"
@Xing :
Your system resources are obviously inadequate for sorting the entire huge file xing.outp_serv, so SQL fails on the first query.
However, for the apparent results you are seeking - to wit, getting the sets of the unique values of the composite keys (enrolid, svcdate) and (enrolid, svcdate, procgrp) - your program does a lot of unnecessary work, and this is exactly where it bombs.
You need only 2 and 3 variables in the end, and yet you start by sorting the entire file with all the (who knows how many) variables you have there. Truth be told, you don't need your query #1 at all: Queries #2 and #3 can run directly against the original xing.outp.serv file just fine all by themselves. Furthermore, you can drop the ORDER clauses in them, as the outputs will be in the required order by virtue of the internal algorithm behind the fulfilling of the DISTINCT request. Also, it's much more efficient to execute query #3 first and then run query #2 against its result set. In other words, this should work:
proc sql ;
create table xing.outp_serv_noduprecs2 as
select distinct enrolid, svcdate, procgrp
from xing.outp_serv
;
create table xing.outp_serv_noduprecs1 as
select distinct enrolid, svcdate
from xing.outp_serv_noduprecs2
;
quit ;
Kind regards
Paul D.
Also, proc sort is more efficient at removing duplicates.
data T;
do I=1 to 5e2;
do J=1 to 5e2;
do K=1 to 1e2;
output;
end; end; end;
run;
proc sort data=T(keep=I J) out=T1 nodupkey noequals; by I J; run;
proc sql; create table T2 as select unique I, J from T order by I, J; quit;
NOTE: PROCEDURE SORT used (Total process time):
real time 7.64 seconds
user cpu time 11.01 seconds
system cpu time 1.82 seconds
NOTE: PROCEDURE SQL used (Total process time):
real time 10.50 seconds
user cpu time 12.26 seconds
system cpu time 4.04 seconds
@ChrisNZ :
NOEQUALS rules. But with keys like these, key-indexing will blow the doors off everything else:
data T1 ;
array h [250500] _temporary_ ;
set T ;
_n_ = I * 5e2 + J ;
if not h[_n_] ;
h[_n_] = 1 ;
run ;
The much more versatile (i.e. not relying on key properties) hash object is no slouch, either:
data T1 ;
if _n_ = 1 then do ;
dcl hash h (hashexp:20) ;
h.definekey ("I", "J") ;
h.definedone () ;
end ;
set T ;
if h.check() ne 0 ;
h.add() ;
run ;
Also, in my tests both SORT and SQL grab over 1G of RAM to get the job done, while the key-indexing and hash object use 2M and 32M, respectively. I haven't tried SORT with SORTSIZE (or SQL with REALMEMSIZE) limited to 32M , but I'll bet that with that limit and resulting swapping, both will drag their feet for a whole lot longer.
Kind regards
Paul D.
I just ran a test of all four methods, and got this log:
36 proc sort data=T(keep=I J) out=T1 nodupkey noequals; by I J; run;
NOTE: 24750000 observations with duplicate key values were deleted.
2 Das SAS System 08:06 Thursday, July 4, 2019
NOTE: The data set WORK.T1 has 250000 observations and 2 variables.
NOTE: PROZEDUR SORT used (Total process time):
real time 5.21 seconds
user cpu time 2.93 seconds
system cpu time 0.08 seconds
memory 193709.28k
OS Memory 217440.00k
Timestamp 04.07.2019 09:10:15 vorm.
Step Count 12 Switch Count 11
Page Faults 0
Page Reclaims 50380
Page Swaps 0
Voluntary Context Switches 602
Involuntary Context Switches 918
Block Input Operations 0
Block Output Operations 0
37
38 proc sql;
38 ! create table T2 as select unique I, J from T order by I, J;
NOTE: Table WORK.T2 created, with 250000 rows and 2 columns.
38 ! quit;
NOTE: PROZEDUR SQL used (Total process time):
real time 15.53 seconds
user cpu time 5.09 seconds
system cpu time 0.67 seconds
memory 193332.84k
OS Memory 216380.00k
Timestamp 04.07.2019 09:10:31 vorm.
Step Count 13 Switch Count 11
Page Faults 0
Page Reclaims 50249
Page Swaps 0
Voluntary Context Switches 684
Involuntary Context Switches 1897
Block Input Operations 0
Block Output Operations 0
39
40 data T3 ;
41 array h [250500] _temporary_ ;
42 set T ;
43 _n_ = I * 5e2 + J ;
44 if not h[_n_] ;
45 h[_n_] = 1 ;
46 run ;
NOTE: There were 25000000 observations read from the data set WORK.T.
NOTE: The data set WORK.T3 has 250000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 4.97 seconds
user cpu time 1.37 seconds
system cpu time 0.07 seconds
memory 2602.43k
OS Memory 26704.00k
3 Das SAS System 08:06 Thursday, July 4, 2019
Timestamp 04.07.2019 09:10:36 vorm.
Step Count 14 Switch Count 2
Page Faults 0
Page Reclaims 495
Page Swaps 0
Voluntary Context Switches 11
Involuntary Context Switches 326
Block Input Operations 0
Block Output Operations 0
47
48 data T4 ;
49 if _n_ = 1 then do ;
50 dcl hash h (hashexp:20) ;
51 h.definekey ("I", "J") ;
52 h.definedone () ;
53 end ;
54 set T ;
55 if h.check() ne 0 ;
56 h.add() ;
57 run ;
NOTE: There were 25000000 observations read from the data set WORK.T.
NOTE: The data set WORK.T4 has 250000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 13.26 seconds
user cpu time 3.61 seconds
system cpu time 0.07 seconds
memory 33555.28k
OS Memory 57588.00k
Timestamp 04.07.2019 09:10:49 vorm.
Step Count 15 Switch Count 26
Page Faults 2
Page Reclaims 7709
Page Swaps 0
Voluntary Context Switches 81
Involuntary Context Switches 1316
Block Input Operations 0
Block Output Operations 0
(performed on a pSeries with two POWER7 (or -8) cores, and a solid SAN for storage; also there's a severe MEMSIZE limit to enable lots of parallel SAS processes without paging)
SORT and the array method are close, strongly beating both the hash and the SQL.
Since the SORT has to make the least assumptions (read: none) about expected size, and needs the simplest code, it's still the preferred tool.
Bottom line (for me): if you need a certain order, PROC SORT rules the roost.
Apparently, much depends on the system set up. I ran my tests on the very laptop I'm typing this, a very old ThinkPad W520 i7 with SAS 9.4 1M4 under X64_7PRO, 2.4 GHz, 8G RAM, and 500G SSD. MEMSIZE=REALMEMSIZE=SORTSIZE=4G. My average figures over 10 runs (real time in seconds / memory in MB):
Kind regards
Paul D.
@hashman That is a nice, succinct but very useful table.
Hash is not faster than proc sort for me, but uses less CPU (and much less memory).
data T1 ;
if _n_ = 1 then do ;
dcl hash h (hashexp:20) ;
h.definekey ("I", "J") ;
h.definedone () ;
end ;
set T ;
if h.check() ne 0 ;
h.add() ;
run ;
data _null_ ;
call missing(I,J);
dcl hash h (dataset: 'T', hashexp:20, ordered:'a') ;
h.definekey ("I", "J") ;
h.definedone () ;
h.output(dataset:'T1') ;
run ;
NOTE: DATA statement used (Total process time):
real time 11.23 seconds
user cpu time 8.84 seconds
system cpu time 1.56 seconds
NOTE: DATA statement used (Total process time):
real time 9.50 seconds
user cpu time 7.75 seconds
system cpu time 1.25 seconds
Hi, I used the codes you provided and the log says insufficient space in WORK. Any further suggestions?
73 proc sql;
74 /*remove duplicates w/ the same enrolid, svcdate, and procgrp*/
75 create table xing.outp_serv_noduprecs2 as
76 select distinct enrolid, svcdate, procgrp
77 from xing.outp_serv;
ERROR: Insufficient space in file WORK.'SASTMP-000000018'n.UTILITY.
ERROR: File WORK.'SASTMP-000000018'n.UTILITY is damaged. I/O processing did not complete.
NOTE: Error was encountered during utility-file processing. You may be able to execute the SQL statement successfully if you
allocate more space to the WORK library.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
78
79 /*remove duplicates w/ the same enrolid and svcdate*/
80 create table xing.outp_serv_noduprecs1 as
81 select distinct enrolid, svcdate
82 from xing.outp_serv;
NOTE: Statement not executed due to NOEXEC option.
83 quit;
You have run out of space in your SAS WORK library. Assuming you are using a remote SAS server, talk to your SAS administrator regarding an increase in SAS WORK space.
Use a hash table as shown, if the results suit your needs.
The memory requirements are much much lower.
@Xing :
First, I'd heed to advise given by @SASKiwi but would also urge you to clean up your work library before you resubmit the job just in case you have too much leftover junk there by running, for example:
proc delete data = work._all_ ;
run ;
If the job still fails and your admin isn't responsive enough to your needs, you'll have to get craftier, as many old-timers used to when trying to get more utility or sort DASD space was akin to starting a small war.
First, the execution algorithm behind SQL is obviously I/O bound; hence you can try to move the burden to memory - provided, of course, that you have enough. In other words, you can try using a hash instead, e.g.:
data xing.outp_serv_noduprecs2 ;
if _n_ = 1 then do ;
dcl hash h () ;
h.definekey ("enrolid", "svcdate", "procgrp") ;
h.definedata ("enrolid", "svcdate", "procgrp") ;
h.definedone() ;
end ;
set xing.outp_serv (keep = enrolid svcdate procgrp) ;
if h.check() ne 0 ;
h.add() ;
run ;
This code will preserve the original relative order of the input records. If you want the output ordered, run instead:
data _null_ ;
if _n_ = 1 then do ;
dcl hash h (ordered:"A") ;
h.definekey ("enrolid", "svcdate", "procgrp") ;
h.definedata ("enrolid", "svcdate", "procgrp") ;
h.definedone() ;
end ;
set xing.outp_serv (keep = enrolid svcdate procgrp) end = z ;
h.ref() ;
if z then h.output (dataset:"xing.outp_serv_noduprecs2") ;
run ;
By the nature of the algorithm, memory usage for both variants above will be the same since in both, at the end the hash table contains all unique tuples of (enrolid, svcdate, procgrp).
If you discover that it works (i.e. you have more than enough memory for the task), then you can use the output to create the second file, xing.outp_serv_noduprecs1, using either your second SQL query. Or, alternatively, you can use the hash object again, for instance:
data _null_ ;
if _n_ = 1 then do ;
dcl hash h (ordered:"A") ;
h.definekey ("enrolid", "svcdate") ;
h.definedata ("enrolid", "svcdate") ;
h.definedone() ;
end ;
set xing.outp_serv_noduprecs2 (keep = enrolid svcdate procgrp) end = z ;
h.ref() ;
if z then h.output (dataset:"xing.outp_serv_noduprecs1") ;
run ;
In fact, if your SAS session is awash with RAM, so that it can hold both hash tables at once, both files can be created in a single pass through the input data:
data _null_ ;
dcl hash h1 (ordered:"A") ;
h1.definekey ("enrolid", "svcdate") ;
h1.definedata ("enrolid", "svcdate") ;
h1.definedone() ;
dcl hash h2 (ordered:"A") ;
h2.definekey ("enrolid", "svcdate", "procgrp") ;
h2.definedata ("enrolid", "svcdate", "procgrp") ;
h2.definedone() ;
do until (z) ;
set xing.outp_serv (keep = enrolid svcdate procgrp) end = z ;
h1.ref() ;
h2.ref() ;
end ;
h1.output (dataset:"xing.outp_serv_noduprecs1") ;
h2.output (dataset:"xing.outp_serv_noduprecs2") ;
run ;
Should all of the above still fail, you'd have to get still craftier and split your input into N key-independent subgroups, sacrificing some efficiency by making N passes through the input data. However, it virtually guarantees that if N is high enough, your job will eventually succeed. Since it's a topic a bit too voluminous for a reply to a question here, I'd merely suggest that you read this paper:
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/1755-2018.pdf
Kind regards
Paul D.
@Xing I have a suggestion, on similar lines to the responses given by other members.
Your first SQL step is likely to be the most costly. I see that you are feeding one SQL step into the other. If you are not using first SQL block anywhere else then I suggest you start with second (or may be even the third and last) SQL block and defer the ORDER BY to the third and last SQL block. I see that all you want is distinct values of enrolid, svcdate and procgrp in the end sorted by the same variables. The DISTINCT clause should work before ORDER BY and that way you will have less number of rows to sort.
In summary defer the ORDER BY as late as possible.
I might be wrong here but worth a try.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.