Solved: Re: SQL sort execution failure with big data

bebess · Posted 01-19-2022 05:26 AM

Hi all ,

i am having an issue when aggregating result from a table using an SQL procedure that perform basic sum calculations and a group by .

i am having the issue with a big dataset 177GB ( 251 951 350 observations and 138 variables), below the error message :

NOTE: At least one nonessential grouping column reference has been removed from a GROUP BY's reference list.
WARNING: This CREATE TABLE statement recursively references the target table. A consequence of this is a possible data integrity
problem.
ERROR: Sort execution failure.

NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
MPRINT(AGGREGATE): QUIT;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 1:08:47.47
user cpu time 11:37.53
system cpu time 2:51.71
memory 1056479.00k
OS Memory 1076112.00k
Timestamp 01/18/2022 08:53:46 PM
Step Count 172 Switch Count 1

i know te first note refers to 16074 - "WARNING: At least one nonessential grouping column reference has been removed from a GROUP ...

The other point is that i'm overwritting the table by using the same name.

i'm sure the SQL procedure is using more space than the total space of the table because i'm having 2TB of free space in my Work session and i've noticed a huge consumption of space,

i am using compress=Yes option and the UTILOC refers to my work drive

My questions are :

1. How much space the SQL procedure needs ( 4 times the space of the data ? )

2. is there an alternative to my query ( that will not over consume memory or disk space ) ?

proc sql;

create table xxx as

select "result" as id,

list_of_var,

sum(var_n) as var_new

from xxx

group by id, list_of_var

;

quit;

Kurt_Bremser · Posted 01-21-2022 07:26 AM

@bebess wrote:
i've tested the sort procedure having the table in my Work sesson ( local ) and i haven't notice an improvment of performance, i've still a big difference cpu and real time , the sort takes around 8 hours !

Which means that your storage is not up to the task. Get in touch with your SAS administrators, they need to improve the performance if you need to work with such data sizes regularly.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

acordes · Posted 01-19-2022 05:34 AM

Do you have cas language avaible, do you have sas viya?

bebess · Posted 01-19-2022 06:19 AM

No it's SAS Enterprise Guide 7.15 , servers hosted in Azure

yabwon · Posted 01-19-2022 06:16 AM

Hi,

1) How many different "by groups" you have in that dataset? are they milions or maybe tousands?

2) A "quick try" could be to use Proc Sort on the data set with TAGSORT option and then use data step by-group processing to sum up your data.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

bebess · Posted 01-19-2022 06:26 AM

1) i expect to have around 190 348 410 obs for the aggregated table.
2 ) will have a look on that 🙂

yabwon · Posted 01-19-2022 06:48 AM

Look at this video and article by @hashman and @DonH , it may give you some hints how to process your data set.

https://communities.sas.com/t5/SAS-Global-Forum-Proceedings/Uniform-Hashing-of-Arbitrary-Input-Into-...

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

bebess · Posted 01-19-2022 10:10 AM

i am trying to use this technique in a subset of the original data , however i've the following issue

WARNING: In a call to the CATX function, the buffer allocated for the result was not long enough to contain the concatenation of
all the arguments. The correct result would contain 336 characters, but the actual result might either be truncated to 200
character(s) or be completely blank, depending on the calling environment. The following note indicates the left-most
argument that caused truncation.

i am using the macro below and having a lot of key variables instead of ID and Key.

Any idea how to support length ( greater than 200 ) ?

%macro agg (in=, out=, N_segments=) ;
%local X seg ;
%let X=1+mod(input(char(MD5(catx(":",ID,Key)),10),pib1.),&N_segments) ;
%do seg = 1 %to &N_segments ;
proc sql;
create table seg as
select ID, Key, sum(Var) as SUM, count (distinct Var) as UCOUNT
from &in where &X = &seg group ID, Key;
quit;
proc append base=&out data=seg; run;
%end ;
%mend ;
%agg (in=Trans, out=Agg, N_segments=3)

Kurt_Bremser · Posted 01-19-2022 10:30 AM

Maxim 1.

From the documentation of the CATX Function:

Length of Returned Variable: Special Cases

The CATX function returns a value to a variable or returns a value in a temporary buffer. The value that is returned from the CATX function has one of these lengths:

up to 200 characters in WHERE clauses and in PROC SQL

Try to use PROC SORT and PROC SUMMARY/MEANS (as I already suggested) first on the whole dataset instead of resorting to splitting it up.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

bebess · Posted 01-19-2022 10:36 AM

ok will try 😉

bebess · Posted 01-20-2022 02:55 AM

The sort takes too long time, any idea how to solve the issue of having real time very higher to the CPU + system times ?

The summary time is more acceptable .

NOTE: There were 251951350 observations read from the data set TEMP.EA_RESULT_4491.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.YYY has 251951350 observations and 82 variables.
NOTE: Compressing data set WORK.YYY decreased size by 85.55 percent.
Compressed is 910365 pages; un-compressed would require 6298784 pages.
NOTE: PROCEDURE SORT used (Total process time):
real time 9:14:07.85
user cpu time 50:21.95
system cpu time 14:06.46
memory 1059797.87k
OS Memory 1081344.00k
Timestamp 01/20/2022 02:07:44 AM
Step Count 71 Switch Count 50140

NOTE: There were 251951350 observations read from the data set WORK.YYY.
NOTE: The data set TEMP.ZZZ has 243403020 observations and 82 variables.
NOTE: Compressing data set TEMP.ZZZ decreased size by 85.48 percent.
Compressed is 883779 pages; un-compressed would require 6085076 pages.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 1:10:18.26
user cpu time 39:05.53
system cpu time 4:26.67
memory 3041.71k
OS Memory 24680.00k
Timestamp 01/20/2022 03:18:02 AM
Step Count 72 Switch Count 6666

Kurt_Bremser · Posted 01-20-2022 04:04 AM

In both cases you are clearly I/O bound, especially during the sort.

Where is your TEMP library located?

Try to use the TAGSORT option in PROC SORT, this will at least reduce the size of the utility file, but if your real bottleneck is TEMP, then it will cause even longer time (as the source file needs to be read twice, and random at that).

All your working datasets (not something you keep as an "archive" and access very rarely) should be kept on local disks or disks accessed through a high-speed SAN. SAS is I/O intensive and needs top-level storage with large amounts of data.

Bottom-line: for more in-depth help, we need to know the details of your SAS setup (operating system - virtual machine?, type and number of processors available, layout of disk storage and location of libraries in there).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

bebess · Posted 01-20-2022 08:36 AM

TEMP refers is under Azure location like SAS servers but probably not in the same place exactly ( will check this point )

sas guide version : 7.15 HF3 (7.100.5.6132) (64-bit)

Virtual machine : Windows (Windows Server 2016 Datacenter)

Standard D16as v4 (16 virtual processors, 64 Gio of memory)

Group=PERFORMANCE
ARMAGENT= Specifies an ARM agent (which is an executable module or keyword, such as LOG4SAS) that contains a specific
implementation of the ARM API.
ARMLOC=ARMLOG.LOG Specifies the location of the ARM log.
ARMSUBSYS=(ARM_NONE)
Specifies the SAS ARM subsystems to enable or disable.
BUFNO=1 Specifies the number of buffers for processing SAS data sets.
BUFSIZE=0 Specifies the size of a buffer page for output SAS data sets.
CGOPTIMIZE=3 Specifies the level of optimization to perform during code compilation.
CMPMODEL=BOTH Specifies the output model type for the MODEL procedure.
CMPOPT=(NOEXTRAMATH NOMISSCHECK NOPRECISE NOGUARDCHECK NOGENSYMNAMES NOFUNCDIFFERENCING SHORTCIRCUIT NOPROFILE NODEBUGHOST
NODEBUGPORT)
Specifies the type of code-generation optimizations to use in the SAS language compiler.
COMPRESS=YES Specifies the type of compression to use for observations in output SAS data sets.
CPUCOUNT=4 Specifies the number of processors that thread-enabled applications should assume are available for concurrent
processing.
NODBIDIRECTEXEC The SQL pass-through facility does not optimize the handling of SQL statements.
DBSLICEPARM=(THREADED_APPS, 2)
Specifies whether SAS procedures, applications, and the DATA step can read DBMS tables in parallel, and the
number of threads to use to read the DBMS tables.
DS2ACCEL=NONE Provides support for DS2 code pass-through acceleration.
DSACCEL=NONE Provides support for code pass-through acceleration.
HADOOPPLATFORM=MAPRED
Specifies the execution platform for the SAS In-Database Code Accelerator for Hadoop.
2 The SAS System 09:27 Wednesday, January 19, 2022

MAXSEGRATIO=75 Specifies the upper limit for the percentage of index segments that the SPD Engine identifies as containing the
value referenced in the WHERE expression.
MEXECSIZE=65536 Specifies the maximum macro size that can be executed in memory.
MINPARTSIZE=16777216
Specifies the minimum size of the data component partitions for SPD Engine data sets.
SORTSIZE=1073741824
Specifies the amount of memory that is available to the SORT procedure.
NOSPDEFILECACHE Disables caching of opened SPD Engine files.
SPDEINDEXSORTSIZE=33554432
Specifies the memory size for sorting index values.
SPDEMAXTHREADS=0 Specifies the maximum number of threads that the SPD Engine can spawn for I/O processing.
SPDEPARALLELREAD=NO
Enables or disables SPD Engine parallel reads when no WHERE clause is in effect.
SPDESORTSIZE=33554432
Specifies the memory size that is used for sorting by the SPD Engine.
SPDEUTILLOC= Specifies one or more locations where the SPD Engine can temporarily store utility files.
SPDEWHEVAL=COST Specifies the WHERE statement evaluation process for the SPD Engine.
SQLGENERATION=(NONE DBMS='TERADATA DB2 ORACLE NETEZZA ASTER GREENPLM HADOOP SAPHANA IMPALA HAWQ POSTGRES REDSHIFT SQLSVR VERTICA')
Specifies whether and when SAS procedures generate SQL for in-database processing of source data.
SQLREDUCEPUT=DBMS For PROC SQL, specifies the engine type to use to optimize a PUT function in a query.
SQLREDUCEPUTOBS=0 For PROC SQL, specifies the minimum number of observations that must be in a table for PROC SQL to optimize the
PUT function in a query.
SQLREDUCEPUTVALUES=0
For PROC SQL, specifies the maximum number of SAS format values that can exist in a PUT function expression to
optimize the PUT function in a query.
STRIPESIZE= Specifies path and size pairs to identify I/O device stripe size. Stripe size indicates page size when creating
a data set or utility file.
THREADS Uses threaded processing for SAS applications that support it.
UBUFNO=0 Specifies the number of utility file buffers.
UBUFSIZE=0 Specifies the size of utility file buffers.
UTILLOC=WORK Specifies one or more file system locations in which threaded applications can store utility files.
VBUFSIZE=65536 Specifies the buffer size for a view.

SLEEPCNTL=NO Specifies whether SAS prevents Windows from going into sleep mode.
ALIGNSASIOFILES Aligns SAS files on a page boundary for improved performance.
MEMSIZE=68947703040
Specifies the limit on the amount of virtual memory that can be used during a SAS session.

Kurt_Bremser · Posted 01-20-2022 08:49 AM

If your SAS processing happens in the same Azure instance, that would be fine, but if you have in fact an external network connection from the SAS server to it (read: over the internet or similar), then that will slow it down considerably.

Download the whole dataset once to your local environment, and proceed from there.

Is your Windows server that runs SAS located in the Azure cloud?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

bebess · Posted 01-20-2022 09:40 AM

Yes Windows server that runs SAS is in Azure cloud
It looks like the threaded procedures ( SQL , SORT ...etc) takes much more time , because another data step in my process shows more acceptable time without a big difference between real time and CPU time
NOTE: There were 140027790 observations read from the data set WORK. MYSASTABLE.
NOTE: The data set WORK.MYSASTABLE has 251951350 observations and 131 variables.
NOTE: Compressing data set WORK. MYSASTABLE decreased size by 81.14 percent.
Compressed is 1187708 pages; un-compressed would require 6298784 pages.
NOTE: DATA statement used (Total process time):
real time 1:00:03.45
user cpu time 53:12.29
system cpu time 1:59.54
memory 2564.81k
OS Memory 22720.00k
Timestamp 01/18/2022 03:27:44 PM
Step Count 77 Switch Count 0

Kurt_Bremser · Posted 01-20-2022 10:58 AM

A data step that reads and writes from/to a compressed dataset will produce much less I/O than a sort and any operation in SQL that requires a utility file, as the utility files will be uncompressed. Show us the log of this:

data work.test1(compress=yes);
set temp.ea_result_4491;
run;

data work.test2 (compress=yes);
set work.test1;
run;

You will then see the relative performance between the two libraries, and if (as I suspect) reading from TEMP is significantly slower, then you need to contemplate moving as much data as possible to the SAS server (and keeping it there), or get in touch with the people responsible for setting up that library storage-wise.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Length of Returned Variable: Special Cases

SAS Innovate 2025: Call for Content

Classroom Training Available!