Hi experts.
I have a table of 24M records. Running the following step:
data gmod.&train_table;
set gmod.&train_table;
format split $10.;
if email_dt<="&cutoff"d then split="train";
else split="test";
run;
In was 9.4 takes a bunch of min to execute.
In viya4, cas env., It is taking forever and not completing, after almost one hour.
Any idea about why the data step perform so badly in viya?
Bests
Is your GMOD library a SAS V9 on-disk library or a CAS server in-memory library?
If GMOD is a V9 one, I wouldn't expect faster performance, but not way slower. This needs more investigation. If GMOD is a CAS library then it's possible Viya is spending a lot of time loading the data off disk and into memory if it is not already loaded. Again further investigation is required.
Also CAS functionality like PROC CASUTIL will likely improve performance.
GMOD is a caslibrary. The table is loaded in memory already.
It can be som many things in play that affect execution time.
First, a log wouold eally help.
Secondly, a report that shows your CAS data set distribution of your nodes (proc casutil I belive).
Does it take one hor every time?
If you run below code...
data gmod.&train_table.TEST;
set gmod.&train_table;
if _n_=1;
format split $10.;
if email_dt<="&cutoff"d then split="train";
else split="test";
run;
...does the SAS log tell you that processing happens in CAS and multithreaded?
NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads.
And just to make sure your session is not hanging somewhere: Please reset it first in SAS Studio via Options/Reset SAS Session
In the Viya environments I'm using things are "flying" so I assume it's in your case connection to source data or some other config "thingy" that that needs change.
yes, both CAS and multithread:
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 24160332 observations read from the table SPNDAC_TRAIN_TOPIC_PROP_V1 in caslib Global Marketing Models.
NOTE: The table spndac_train_topic_prop_v1TEST in caslib Global Marketing Models has 128 observations and 539 variables.
NOTE: Sentencia DATA used (Total process time):
real time 2.63 seconds
cpu time 0.04 seconds
The SAS log you posted takes less than three seconds to run. Can you post an example of the slow program?
I have no log, as I run the piece of code:
80 data gmod.&train_table._2;
81 set gmod.&train_table;
82 format split $10.;
83 if email_dt<="&cutoff9"d then split="train";
84 else split="test";
85 run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
It get stuck in running without completing
I've done some testing in my environment (recent Viya 4 with 4 worker nodes and 192 threads) with a table with the same number of rows and columns as yours. I've made the assumption that most of your columns are numeric. In my environment your data step code executes within around 40 seconds.
Here the test code I've used:
options msglevel=i;
cas mysess;
%let sessref=mysess;
%let mypath=&_userhome;
/* define CAS lib and linked SAS lib */
caslib gmod
local
sessref=mysess
path="&mypath"
datasource=(srctype="path")
libref=gmod
;
%let train_table=demo;
%let cutoff=01jan2023;
%let n_vars=536;
%let n_rows=24160332;
/* %let n_rows=1000; */
/* create source table in CAS */
data gmod.&train_table._1;
array vars_ {&n_vars} 8 (&n_vars*1234567890);
format email_dt date9.;
do i=1 to &n_rows;
email_dt=rand('integer',-5,5)+"&cutoff"d;
output;
end;
run;
/* run data step as shared creating a new table in CAS */
data gmod.&train_table._2(compress=yes)/sessref=&sessref;
set gmod.&train_table._1;
format split $10.;
if email_dt<="&cutoff"d then split="train";
else split="test";
run;
/* run data step as shared creating a new table in CAS */
data gmod.&train_table._2_uncompressed/sessref=&sessref;
set gmod.&train_table._1;
format split $10.;
if email_dt<="&cutoff"d then split="train";
else split="test";
run;
/* run data step as shared replacing the existing table in CAS */
data gmod.&train_table._1(compress=yes) /sessref=&sessref;
set gmod.&train_table._1;
format split $10.;
if email_dt<="&cutoff"d then split="train";
else split="test";
run;
/* free up memory: Delete no more required CAS tables */
proc cas;
session="mysess";
table.dropTable /
caslib="gmod",
name="&train_table._1",
quiet=TRUE
;
run;
quit;
cas mysess terminate;
And here the SAS log
1 ***SAS Studio header code*** 79 80 options msglevel=i; 81 82 cas mysess; NOTE: The session MYSESS connected successfully to Cloud Analytic Services sas-cas-server-default-client using port ****. The UUID is ****. The user is **user** and the active caslib is CASUSER(**user**). NOTE: The SAS option SESSREF was updated with the value MYSESS. NOTE: The SAS macro _SESSREF_ was updated with the value MYSESS. NOTE: The session is using 4 workers. 83 84 %let sessref=mysess; 85 %let mypath=&_userhome; 86 87 /* define CAS lib and linked SAS lib */ 88 caslib gmod 89 local 90 sessref=mysess 91 path="&mypath" 92 datasource=(srctype="path") 93 libref=gmod 94 ; NOTE: 'GMOD' is now the active caslib. NOTE: Cloud Analytic Services added the caslib 'GMOD'. NOTE: CASLIB GMOD for session MYSESS will be mapped to SAS Library GMOD. NOTE: Action to ADD caslib GMOD completed for session MYSESS. 95 96 %let train_table=demo; 97 %let cutoff=01jan2023; 98 %let n_vars=536; 99 %let n_rows=24160332; 100 /* %let n_rows=1000; */ 101 102 /* create source table in CAS */ 103 data gmod.&train_table._1; 104 array vars_ {&n_vars} 8 (&n_vars*1234567890); 105 format email_dt date9.; 106 do i=1 to &n_rows; 107 email_dt=rand('integer',-5,5)+"&cutoff"d; 108 output; 109 end; 110 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step has no input data set and will run in a single thread. NOTE: The table demo_1 in caslib GMOD has 24160332 observations and 538 variables. NOTE: DATA statement used (Total process time): real time 4:19.64 cpu time 0.37 seconds 111 112 /* run data step as shared creating a new table in CAS */ 113 data gmod.&train_table._2(compress=yes)/sessref=&sessref; 114 set gmod.&train_table._1; 115 format split $10.; 116 if email_dt<="&cutoff"d then split="train"; 117 else split="test"; 118 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: There were 24160332 observations read from the table DEMO_1 in caslib GMOD. NOTE: The table demo_2 in caslib GMOD has 24160332 observations and 539 variables. NOTE: DATA statement used (Total process time): real time 39.41 seconds cpu time 0.07 seconds 119 120 /* run data step as shared creating a new table in CAS */ 121 data gmod.&train_table._2_uncompressed/sessref=&sessref; 122 set gmod.&train_table._1; 123 format split $10.; 124 if email_dt<="&cutoff"d then split="train"; 125 else split="test"; 126 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: There were 24160332 observations read from the table DEMO_1 in caslib GMOD. NOTE: The table demo_2_uncompressed in caslib GMOD has 24160332 observations and 539 variables. NOTE: DATA statement used (Total process time): real time 2:22.31 cpu time 0.18 seconds 127 128 /* run data step as shared replacing the existing table in CAS */ 129 data gmod.&train_table._1(compress=yes) /sessref=&sessref; 130 set gmod.&train_table._1; 131 format split $10.; 132 if email_dt<="&cutoff"d then split="train"; 133 else split="test"; 134 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: There were 24160332 observations read from the table DEMO_1 in caslib GMOD. NOTE: The table demo_1 in caslib GMOD has 24160332 observations and 539 variables. NOTE: DATA statement used (Total process time): real time 44.76 seconds cpu time 0.09 seconds 135 136 /* free up memory: Delete no more required CAS tables */ 137 proc cas; 138 session="mysess"; 139 table.dropTable / 140 caslib="gmod", 141 name="&train_table._1", 142 quiet=TRUE 143 ; 144 run; NOTE: Active Session now MYSESS. 145 quit; NOTE: PROCEDURE CAS used (Total process time): real time 0.25 seconds cpu time 0.00 seconds 146 147 cas mysess terminate; NOTE: Libref GMOD has been deassigned. NOTE: Deletion of the session MYSESS was successful. NOTE: The default CAS session MYSESS identified by SAS option SESSREF= was terminated. Use the OPTIONS statement to set the SESSREF= option to an active session. NOTE: Request to TERMINATE completed for session MYSESS. 148 149 *** SAS Studio trailer code ***160
Findings in my environment:
Further considerations:
If you really can't get better performance then talk to your SAS admin so he/she can investigate how much CAS memory is available (it's shared so other people can consume too much or there are too many promoted tables loaded) and verify that CAS_DISK_CACHE is configured as it should be.
So I tired to generate a new table from the existing one using the datastep, and In addition to the loading time, this time also an error triggered:
80 data gmod.&train_table._2;
81 set gmod.&train_table;
82 format split $10.;
83 if email_dt<="&cutoff9"d then split="train";
84 else split="test";
85 run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
ERROR: Cloud Analytic Services failed writing to system disk space. Please contact your administrator.
ERROR: Cloud Analytic Services failed writing to system disk space. Please contact your administrator.
ERROR: Cloud Analytic Services failed writing to system disk space. Please
I will probably check with my admin what is going on here
Yes, talk to your SAS Admin. This error indicates very much that there is some resource constraint which needs resolution.
https://sas.service-now.com/csm?id=kb_article_view&sysparm_article=KB0036522
Please share what the root cause of the issue was once your SAS Admin got to the bottom of it.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.