topic Re: SQL sort execution failure with big data in SAS Programming

SQL sort execution failure with big data

bebess — Wed, 19 Jan 2022 10:26:52 GMT

Hi all ,

i am having an issue when aggregating result from a table using an SQL procedure that perform basic sum calculations and a group by .

i am having the issue with a big dataset 177GB ( 251 951 350 observations and 138 variables), below the error message :

NOTE: At least one nonessential grouping column reference has been removed from a GROUP BY's reference list.
WARNING: This CREATE TABLE statement recursively references the target table. A consequence of this is a possible data integrity
problem.
ERROR: Sort execution failure.

NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
MPRINT(AGGREGATE): QUIT;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 1:08:47.47
user cpu time 11:37.53
system cpu time 2:51.71
memory 1056479.00k
OS Memory 1076112.00k
Timestamp 01/18/2022 08:53:46 PM
Step Count 172 Switch Count 1

i know te first note refers to 16074 - "WARNING: At least one nonessential grouping column reference has been removed from a GROUP BY's reference list" (sas.com)

The other point is that i'm overwritting the table by using the same name.

i'm sure the SQL procedure is using more space than the total space of the table because i'm having 2TB of free space in my Work session and i've noticed a huge consumption of space,

i am using compress=Yes option and the UTILOC refers to my work drive

My questions are :

1. How much space the SQL procedure needs ( 4 times the space of the data ? )

2. is there an alternative to my query ( that will not over consume memory or disk space ) ?

proc sql;

create table xxx as

select "result" as id,

list_of_var,

sum(var_n) as var_new

from xxx

group by id, list_of_var

;

quit;

Re: SQL sort execution failure with big data

acordes — Wed, 19 Jan 2022 10:34:44 GMT

Do you have cas language avaible, do you have sas viya?

Re: SQL sort execution failure with big data

yabwon — Wed, 19 Jan 2022 11:16:21 GMT

Hi,

1) How many different "by groups" you have in that dataset? are they milions or maybe tousands?

2) A "quick try" could be to use Proc Sort on the data set with TAGSORT option and then use data step by-group processing to sum up your data.

Bart

Re: SQL sort execution failure with big data

bebess — Wed, 19 Jan 2022 11:19:48 GMT

No it's SAS Enterprise Guide 7.15 , servers hosted in Azure

Re: SQL sort execution failure with big data

bebess — Wed, 19 Jan 2022 11:26:13 GMT

1) i expect to have around 190 348 410 obs for the aggregated table.
2 ) will have a look on that 🙂

Re: SQL sort execution failure with big data

yabwon — Wed, 19 Jan 2022 11:48:25 GMT

Look at this video and article by @hashman and @DonH , it may give you some hints how to process your data set.

https://communities.sas.com/t5/SAS-Global-Forum-Proceedings/Uniform-Hashing-of-Arbitrary-Input-Into-Key-Exclusive-Segments/ta-p/735358

Bart

Re: SQL sort execution failure with big data

LinusH — Wed, 19 Jan 2022 11:56:22 GMT

Sorting - usually y9u need around 2-3 times the size of the table in your UTILLOC. Assuming it has the same location as your saswork?

2TB should be enough, have monitored during execution?

I can see that SQL "only" uses 1GB om RAM. If you have more available to you, maximize MEMSIZE and SORTSIZE options.

TAGSORT is resource effective, but it will take substantiable longer time than the default sort.

And I guess that you really need all input columns as output?

And the data is in a Base SAS library?

Consider using SPDE. It might not solve your current issue, but for large data sets it faster for many use cases. Then you should take a look at the SPDESORTSIZE option.

Re: SQL sort execution failure with big data

Kurt_Bremser — Wed, 19 Jan 2022 12:36:37 GMT

In an "ideal sort", you need three times the size of the table

original table
utility file
resulting table

If the original table is compressed (dataset or system option COMPRESS=yes), the utility file will be larger, sometimes MUCH larger (think of a compression ratio of 95%, the utility file will be 20 times as large). In this case, use the TAGSORT option of PROC SORT. Since this option is not available in SQL, you are better off running your summation in two steps:

Use PROC SORT with BY id list_of_var
Use PROC SUMMARY with the same BY

If there are more variables in your source dataset than those used in BY and for the summation, drop all other variables when sorting, and create an intermediary table:

proc sort
  data=xxx (keep=list_of_var var_n)
  out=yyy
  /* (compress=yes)  tagsort if long character variables exist */
;
by list_of_var;
run;

proc summary data=yyy;
by list_of_var;
var var_n;
output out=zzz sum()=;
run;

You also seem to have non-adequate storage or other CPU-consuming processes, recognizable here:

real time 1:08:47.47
user cpu time 11:37.53
system cpu time 2:51.71

Your real time is more than 4 times the CPU time, which points either to wait states caused by the storage, or contention for CPU resources.

Re: SQL sort execution failure with big data

bebess — Wed, 19 Jan 2022 13:02:42 GMT

only this job was running on the server and i've seen the error in this time .

i need 90% of columns from the table but as i 'm also created new calculated variables so at the end it's like i am having same number of columns but with less observations .

Yes it's a BASE SAS library , a basic SAS table

Re: SQL sort execution failure with big data

bebess — Wed, 19 Jan 2022 15:10:33 GMT

i am trying to use this technique in a subset of the original data , however i've the following issue

WARNING: In a call to the CATX function, the buffer allocated for the result was not long enough to contain the concatenation of
all the arguments. The correct result would contain 336 characters, but the actual result might either be truncated to 200
character(s) or be completely blank, depending on the calling environment. The following note indicates the left-most
argument that caused truncation.

i am using the macro below and having a lot of key variables instead of ID and Key.

Any idea how to support length ( greater than 200 ) ?

%macro agg (in=, out=, N_segments=) ;
%local X seg ;
%let X=1+mod(input(char(MD5(catx(":",ID,Key)),10),pib1.),&N_segments) ;
%do seg = 1 %to &N_segments ;
proc sql;
create table seg as
select ID, Key, sum(Var) as SUM, count (distinct Var) as UCOUNT
from &in where &X = &seg group ID, Key;
quit;
proc append base=&out data=seg; run;
%end ;
%mend ;
%agg (in=Trans, out=Agg, N_segments=3)

Re: SQL sort execution failure with big data

Kurt_Bremser — Wed, 19 Jan 2022 15:30:50 GMT

Maxim 1.

From the documentation of the CATX Function:

Length of Returned Variable: Special Cases

The CATX function returns a value to a variable or returns a value in a temporary buffer. The value that is returned from the CATX function has one of these lengths:

up to 200 characters in WHERE clauses and in PROC SQL

Try to use PROC SORT and PROC SUMMARY/MEANS (as I already suggested) first on the whole dataset instead of resorting to splitting it up.

Re: SQL sort execution failure with big data

bebess — Wed, 19 Jan 2022 15:36:31 GMT

ok will try 😉

Re: SQL sort execution failure with big data

bebess — Thu, 20 Jan 2022 07:55:50 GMT

The sort takes too long time, any idea how to solve the issue of having real time very higher to the CPU + system times ?

The summary time is more acceptable .

NOTE: There were 251951350 observations read from the data set TEMP.EA_RESULT_4491.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.YYY has 251951350 observations and 82 variables.
NOTE: Compressing data set WORK.YYY decreased size by 85.55 percent.
Compressed is 910365 pages; un-compressed would require 6298784 pages.
NOTE: PROCEDURE SORT used (Total process time):
real time 9:14:07.85
user cpu time 50:21.95
system cpu time 14:06.46
memory 1059797.87k
OS Memory 1081344.00k
Timestamp 01/20/2022 02:07:44 AM
Step Count 71 Switch Count 50140

NOTE: There were 251951350 observations read from the data set WORK.YYY.
NOTE: The data set TEMP.ZZZ has 243403020 observations and 82 variables.
NOTE: Compressing data set TEMP.ZZZ decreased size by 85.48 percent.
Compressed is 883779 pages; un-compressed would require 6085076 pages.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 1:10:18.26
user cpu time 39:05.53
system cpu time 4:26.67
memory 3041.71k
OS Memory 24680.00k
Timestamp 01/20/2022 03:18:02 AM
Step Count 72 Switch Count 6666

Re: SQL sort execution failure with big data

Kurt_Bremser — Thu, 20 Jan 2022 09:04:24 GMT

In both cases you are clearly I/O bound, especially during the sort.

Where is your TEMP library located?

Try to use the TAGSORT option in PROC SORT, this will at least reduce the size of the utility file, but if your real bottleneck is TEMP, then it will cause even longer time (as the source file needs to be read twice, and random at that).

All your working datasets (not something you keep as an "archive" and access very rarely) should be kept on local disks or disks accessed through a high-speed SAN. SAS is I/O intensive and needs top-level storage with large amounts of data.

Bottom-line: for more in-depth help, we need to know the details of your SAS setup (operating system - virtual machine?, type and number of processors available, layout of disk storage and location of libraries in there).

Re: SQL sort execution failure with big data

bebess — Thu, 20 Jan 2022 13:36:51 GMT

TEMP refers is under Azure location like SAS servers but probably not in the same place exactly ( will check this point )

sas guide version : 7.15 HF3 (7.100.5.6132) (64-bit)

Virtual machine : Windows (Windows Server 2016 Datacenter)

Standard D16as v4 (16 virtual processors, 64 Gio of memory)

Group=PERFORMANCE
ARMAGENT= Specifies an ARM agent (which is an executable module or keyword, such as LOG4SAS) that contains a specific
implementation of the ARM API.
ARMLOC=ARMLOG.LOG Specifies the location of the ARM log.
ARMSUBSYS=(ARM_NONE)
Specifies the SAS ARM subsystems to enable or disable.
BUFNO=1 Specifies the number of buffers for processing SAS data sets.
BUFSIZE=0 Specifies the size of a buffer page for output SAS data sets.
CGOPTIMIZE=3 Specifies the level of optimization to perform during code compilation.
CMPMODEL=BOTH Specifies the output model type for the MODEL procedure.
CMPOPT=(NOEXTRAMATH NOMISSCHECK NOPRECISE NOGUARDCHECK NOGENSYMNAMES NOFUNCDIFFERENCING SHORTCIRCUIT NOPROFILE NODEBUGHOST
NODEBUGPORT)
Specifies the type of code-generation optimizations to use in the SAS language compiler.
COMPRESS=YES Specifies the type of compression to use for observations in output SAS data sets.
CPUCOUNT=4 Specifies the number of processors that thread-enabled applications should assume are available for concurrent
processing.
NODBIDIRECTEXEC The SQL pass-through facility does not optimize the handling of SQL statements.
DBSLICEPARM=(THREADED_APPS, 2)
Specifies whether SAS procedures, applications, and the DATA step can read DBMS tables in parallel, and the
number of threads to use to read the DBMS tables.
DS2ACCEL=NONE Provides support for DS2 code pass-through acceleration.
DSACCEL=NONE Provides support for code pass-through acceleration.
HADOOPPLATFORM=MAPRED
Specifies the execution platform for the SAS In-Database Code Accelerator for Hadoop.
2 The SAS System 09:27 Wednesday, January 19, 2022

MAXSEGRATIO=75 Specifies the upper limit for the percentage of index segments that the SPD Engine identifies as containing the
value referenced in the WHERE expression.
MEXECSIZE=65536 Specifies the maximum macro size that can be executed in memory.
MINPARTSIZE=16777216
Specifies the minimum size of the data component partitions for SPD Engine data sets.
SORTSIZE=1073741824
Specifies the amount of memory that is available to the SORT procedure.
NOSPDEFILECACHE Disables caching of opened SPD Engine files.
SPDEINDEXSORTSIZE=33554432
Specifies the memory size for sorting index values.
SPDEMAXTHREADS=0 Specifies the maximum number of threads that the SPD Engine can spawn for I/O processing.
SPDEPARALLELREAD=NO
Enables or disables SPD Engine parallel reads when no WHERE clause is in effect.
SPDESORTSIZE=33554432
Specifies the memory size that is used for sorting by the SPD Engine.
SPDEUTILLOC= Specifies one or more locations where the SPD Engine can temporarily store utility files.
SPDEWHEVAL=COST Specifies the WHERE statement evaluation process for the SPD Engine.
SQLGENERATION=(NONE DBMS='TERADATA DB2 ORACLE NETEZZA ASTER GREENPLM HADOOP SAPHANA IMPALA HAWQ POSTGRES REDSHIFT SQLSVR VERTICA')
Specifies whether and when SAS procedures generate SQL for in-database processing of source data.
SQLREDUCEPUT=DBMS For PROC SQL, specifies the engine type to use to optimize a PUT function in a query.
SQLREDUCEPUTOBS=0 For PROC SQL, specifies the minimum number of observations that must be in a table for PROC SQL to optimize the
PUT function in a query.
SQLREDUCEPUTVALUES=0
For PROC SQL, specifies the maximum number of SAS format values that can exist in a PUT function expression to
optimize the PUT function in a query.
STRIPESIZE= Specifies path and size pairs to identify I/O device stripe size. Stripe size indicates page size when creating
a data set or utility file.
THREADS Uses threaded processing for SAS applications that support it.
UBUFNO=0 Specifies the number of utility file buffers.
UBUFSIZE=0 Specifies the size of utility file buffers.
UTILLOC=WORK Specifies one or more file system locations in which threaded applications can store utility files.
VBUFSIZE=65536 Specifies the buffer size for a view.

SLEEPCNTL=NO Specifies whether SAS prevents Windows from going into sleep mode.
ALIGNSASIOFILES Aligns SAS files on a page boundary for improved performance.
MEMSIZE=68947703040
Specifies the limit on the amount of virtual memory that can be used during a SAS session.

Re: SQL sort execution failure with big data

Kurt_Bremser — Thu, 20 Jan 2022 13:49:42 GMT

If your SAS processing happens in the same Azure instance, that would be fine, but if you have in fact an external network connection from the SAS server to it (read: over the internet or similar), then that will slow it down considerably.

Download the whole dataset once to your local environment, and proceed from there.

Is your Windows server that runs SAS located in the Azure cloud?

Re: SQL sort execution failure with big data

bebess — Thu, 20 Jan 2022 14:40:47 GMT

Yes Windows server that runs SAS is in Azure cloud
It looks like the threaded procedures ( SQL , SORT ...etc) takes much more time , because another data step in my process shows more acceptable time without a big difference between real time and CPU time
NOTE: There were 140027790 observations read from the data set WORK. MYSASTABLE.
NOTE: The data set WORK.MYSASTABLE has 251951350 observations and 131 variables.
NOTE: Compressing data set WORK. MYSASTABLE decreased size by 81.14 percent.
Compressed is 1187708 pages; un-compressed would require 6298784 pages.
NOTE: DATA statement used (Total process time):
real time 1:00:03.45
user cpu time 53:12.29
system cpu time 1:59.54
memory 2564.81k
OS Memory 22720.00k
Timestamp 01/18/2022 03:27:44 PM
Step Count 77 Switch Count 0

Re: SQL sort execution failure with big data

Kurt_Bremser — Thu, 20 Jan 2022 15:58:32 GMT

A data step that reads and writes from/to a compressed dataset will produce much less I/O than a sort and any operation in SQL that requires a utility file, as the utility files will be uncompressed. Show us the log of this:

data work.test1(compress=yes);
set temp.ea_result_4491;
run;

data work.test2 (compress=yes);
set work.test1;
run;

You will then see the relative performance between the two libraries, and if (as I suspect) reading from TEMP is significantly slower, then you need to contemplate moving as much data as possible to the SAS server (and keeping it there), or get in touch with the people responsible for setting up that library storage-wise.

Re: SQL sort execution failure with big data

bebess — Fri, 21 Jan 2022 08:10:48 GMT

i've tested the sort procedure having the table in my Work sesson ( local ) and i haven't notice an improvment of performance, i've still a big difference cpu and real time , the sort takes around 8 hours !
do you think it can improve performance if i use option SPDESORTSIZE , actually i can see the value is very low , what do you think ?

SPDESORTSIZE=33554432
Specifies the memory size that is used for sorting by the SPD Engine.

Re: SQL sort execution failure with big data

Kurt_Bremser — Fri, 21 Jan 2022 12:26:15 GMT

@bebess wrote:
i've tested the sort procedure having the table in my Work sesson ( local ) and i haven't notice an improvment of performance, i've still a big difference cpu and real time , the sort takes around 8 hours !

Which means that your storage is not up to the task. Get in touch with your SAS administrators, they need to improve the performance if you need to work with such data sizes regularly.