Hi all ,
i am having an issue when aggregating result from a table using an SQL procedure that perform basic sum calculations and a group by .
i am having the issue with a big dataset 177GB ( 251 951 350 observations and 138 variables), below the error message :
NOTE: At least one nonessential grouping column reference has been removed from a GROUP BY's reference list.
WARNING: This CREATE TABLE statement recursively references the target table. A consequence of this is a possible data integrity
problem.
ERROR: Sort execution failure.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
MPRINT(AGGREGATE): QUIT;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 1:08:47.47
user cpu time 11:37.53
system cpu time 2:51.71
memory 1056479.00k
OS Memory 1076112.00k
Timestamp 01/18/2022 08:53:46 PM
Step Count 172 Switch Count 1
i know te first note refers to 16074 - "WARNING: At least one nonessential grouping column reference has been removed from a GROUP ...
The other point is that i'm overwritting the table by using the same name.
i'm sure the SQL procedure is using more space than the total space of the table because i'm having 2TB of free space in my Work session and i've noticed a huge consumption of space,
i am using compress=Yes option and the UTILOC refers to my work drive
My questions are :
1. How much space the SQL procedure needs ( 4 times the space of the data ? )
2. is there an alternative to my query ( that will not over consume memory or disk space ) ?
proc sql;
create table xxx as
select "result" as id,
list_of_var,
sum(var_n) as var_new
from xxx
group by id, list_of_var
;
quit;
@bebess wrote:
i've tested the sort procedure having the table in my Work sesson ( local ) and i haven't notice an improvment of performance, i've still a big difference cpu and real time , the sort takes around 8 hours !
Which means that your storage is not up to the task. Get in touch with your SAS administrators, they need to improve the performance if you need to work with such data sizes regularly.
Do you have cas language avaible, do you have sas viya?
Hi,
1) How many different "by groups" you have in that dataset? are they milions or maybe tousands?
2) A "quick try" could be to use Proc Sort on the data set with TAGSORT option and then use data step by-group processing to sum up your data.
Bart
Look at this video and article by @hashman and @DonH , it may give you some hints how to process your data set.
Bart
i am trying to use this technique in a subset of the original data , however i've the following issue
WARNING: In a call to the CATX function, the buffer allocated for the result was not long enough to contain the concatenation of
all the arguments. The correct result would contain 336 characters, but the actual result might either be truncated to 200
character(s) or be completely blank, depending on the calling environment. The following note indicates the left-most
argument that caused truncation.
i am using the macro below and having a lot of key variables instead of ID and Key.
Any idea how to support length ( greater than 200 ) ?
%macro agg (in=, out=, N_segments=) ;
%local X seg ;
%let X=1+mod(input(char(MD5(catx(":",ID,Key)),10),pib1.),&N_segments) ;
%do seg = 1 %to &N_segments ;
proc sql;
create table seg as
select ID, Key, sum(Var) as SUM, count (distinct Var) as UCOUNT
from &in where &X = &seg group ID, Key;
quit;
proc append base=&out data=seg; run;
%end ;
%mend ;
%agg (in=Trans, out=Agg, N_segments=3)
Maxim 1.
From the documentation of the CATX Function:
The CATX function returns a value to a variable or returns a value in a temporary buffer. The value that is returned from the CATX function has one of these lengths:
Try to use PROC SORT and PROC SUMMARY/MEANS (as I already suggested) first on the whole dataset instead of resorting to splitting it up.
In both cases you are clearly I/O bound, especially during the sort.
Where is your TEMP library located?
Try to use the TAGSORT option in PROC SORT, this will at least reduce the size of the utility file, but if your real bottleneck is TEMP, then it will cause even longer time (as the source file needs to be read twice, and random at that).
All your working datasets (not something you keep as an "archive" and access very rarely) should be kept on local disks or disks accessed through a high-speed SAN. SAS is I/O intensive and needs top-level storage with large amounts of data.
Bottom-line: for more in-depth help, we need to know the details of your SAS setup (operating system - virtual machine?, type and number of processors available, layout of disk storage and location of libraries in there).
TEMP refers is under Azure location like SAS servers but probably not in the same place exactly ( will check this point )
sas guide version : 7.15 HF3 (7.100.5.6132) (64-bit)
Virtual machine : Windows (Windows Server 2016 Datacenter)
Standard D16as v4 (16 virtual processors, 64 Gio of memory)
Group=PERFORMANCE
ARMAGENT= Specifies an ARM agent (which is an executable module or keyword, such as LOG4SAS) that contains a specific
implementation of the ARM API.
ARMLOC=ARMLOG.LOG Specifies the location of the ARM log.
ARMSUBSYS=(ARM_NONE)
Specifies the SAS ARM subsystems to enable or disable.
BUFNO=1 Specifies the number of buffers for processing SAS data sets.
BUFSIZE=0 Specifies the size of a buffer page for output SAS data sets.
CGOPTIMIZE=3 Specifies the level of optimization to perform during code compilation.
CMPMODEL=BOTH Specifies the output model type for the MODEL procedure.
CMPOPT=(NOEXTRAMATH NOMISSCHECK NOPRECISE NOGUARDCHECK NOGENSYMNAMES NOFUNCDIFFERENCING SHORTCIRCUIT NOPROFILE NODEBUGHOST
NODEBUGPORT)
Specifies the type of code-generation optimizations to use in the SAS language compiler.
COMPRESS=YES Specifies the type of compression to use for observations in output SAS data sets.
CPUCOUNT=4 Specifies the number of processors that thread-enabled applications should assume are available for concurrent
processing.
NODBIDIRECTEXEC The SQL pass-through facility does not optimize the handling of SQL statements.
DBSLICEPARM=(THREADED_APPS, 2)
Specifies whether SAS procedures, applications, and the DATA step can read DBMS tables in parallel, and the
number of threads to use to read the DBMS tables.
DS2ACCEL=NONE Provides support for DS2 code pass-through acceleration.
DSACCEL=NONE Provides support for code pass-through acceleration.
HADOOPPLATFORM=MAPRED
Specifies the execution platform for the SAS In-Database Code Accelerator for Hadoop.
2 The SAS System 09:27 Wednesday, January 19, 2022
MAXSEGRATIO=75 Specifies the upper limit for the percentage of index segments that the SPD Engine identifies as containing the
value referenced in the WHERE expression.
MEXECSIZE=65536 Specifies the maximum macro size that can be executed in memory.
MINPARTSIZE=16777216
Specifies the minimum size of the data component partitions for SPD Engine data sets.
SORTSIZE=1073741824
Specifies the amount of memory that is available to the SORT procedure.
NOSPDEFILECACHE Disables caching of opened SPD Engine files.
SPDEINDEXSORTSIZE=33554432
Specifies the memory size for sorting index values.
SPDEMAXTHREADS=0 Specifies the maximum number of threads that the SPD Engine can spawn for I/O processing.
SPDEPARALLELREAD=NO
Enables or disables SPD Engine parallel reads when no WHERE clause is in effect.
SPDESORTSIZE=33554432
Specifies the memory size that is used for sorting by the SPD Engine.
SPDEUTILLOC= Specifies one or more locations where the SPD Engine can temporarily store utility files.
SPDEWHEVAL=COST Specifies the WHERE statement evaluation process for the SPD Engine.
SQLGENERATION=(NONE DBMS='TERADATA DB2 ORACLE NETEZZA ASTER GREENPLM HADOOP SAPHANA IMPALA HAWQ POSTGRES REDSHIFT SQLSVR VERTICA')
Specifies whether and when SAS procedures generate SQL for in-database processing of source data.
SQLREDUCEPUT=DBMS For PROC SQL, specifies the engine type to use to optimize a PUT function in a query.
SQLREDUCEPUTOBS=0 For PROC SQL, specifies the minimum number of observations that must be in a table for PROC SQL to optimize the
PUT function in a query.
SQLREDUCEPUTVALUES=0
For PROC SQL, specifies the maximum number of SAS format values that can exist in a PUT function expression to
optimize the PUT function in a query.
STRIPESIZE= Specifies path and size pairs to identify I/O device stripe size. Stripe size indicates page size when creating
a data set or utility file.
THREADS Uses threaded processing for SAS applications that support it.
UBUFNO=0 Specifies the number of utility file buffers.
UBUFSIZE=0 Specifies the size of utility file buffers.
UTILLOC=WORK Specifies one or more file system locations in which threaded applications can store utility files.
VBUFSIZE=65536 Specifies the buffer size for a view.
SLEEPCNTL=NO Specifies whether SAS prevents Windows from going into sleep mode.
ALIGNSASIOFILES Aligns SAS files on a page boundary for improved performance.
MEMSIZE=68947703040
Specifies the limit on the amount of virtual memory that can be used during a SAS session.
If your SAS processing happens in the same Azure instance, that would be fine, but if you have in fact an external network connection from the SAS server to it (read: over the internet or similar), then that will slow it down considerably.
Download the whole dataset once to your local environment, and proceed from there.
Is your Windows server that runs SAS located in the Azure cloud?
A data step that reads and writes from/to a compressed dataset will produce much less I/O than a sort and any operation in SQL that requires a utility file, as the utility files will be uncompressed. Show us the log of this:
data work.test1(compress=yes);
set temp.ea_result_4491;
run;
data work.test2 (compress=yes);
set work.test1;
run;
You will then see the relative performance between the two libraries, and if (as I suspect) reading from TEMP is significantly slower, then you need to contemplate moving as much data as possible to the SAS server (and keeping it there), or get in touch with the people responsible for setting up that library storage-wise.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.