BookmarkSubscribeRSS Feed
dcortell
Pyrite | Level 9

Hi experts.

 

I have a table of 24M records. Running the following step:

 

data gmod.&train_table;
set gmod.&train_table;
format split $10.;
if email_dt<="&cutoff"d then split="train";
else split="test";
run;

In was 9.4 takes a bunch of min to execute.

 

In viya4, cas env., It is taking forever and not completing, after almost one hour.

 

Any idea about  why the data step perform so badly in viya?

 

Bests

10 REPLIES 10
SASKiwi
PROC Star

Is your GMOD library a SAS V9 on-disk library or a CAS server in-memory library?

 

If GMOD is a V9 one, I wouldn't expect faster performance, but not way slower. This needs more investigation. If GMOD is a CAS library then it's possible Viya is spending a lot of time loading the data off disk and into memory if it is not already loaded. Again further investigation is required.

 

Also CAS functionality like PROC CASUTIL will likely improve performance.

dcortell
Pyrite | Level 9

GMOD is a caslibrary. The table is loaded in memory already.

LinusH
Tourmaline | Level 20

It can be som many things in play that affect execution time.

First, a log wouold eally help.

Secondly, a report that shows your CAS data set distribution of your nodes (proc casutil I belive).

Does it take one hor every time?

Data never sleeps
Patrick
Opal | Level 21

If you run below code...

data gmod.&train_table.TEST;
  set gmod.&train_table;
  if _n_=1;
  format split $10.;
  if email_dt<="&cutoff"d then split="train";
  else split="test";
run;

...does the SAS log tell you that processing happens in CAS and multithreaded?

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

 

And just to make sure your session is not hanging somewhere: Please reset it first in SAS Studio via Options/Reset SAS Session

 

In the Viya environments I'm using things are "flying" so I assume it's in your case connection to source data or some other config "thingy" that that needs change.

dcortell
Pyrite | Level 9

yes, both CAS and multithread: 

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 24160332 observations read from the table SPNDAC_TRAIN_TOPIC_PROP_V1 in caslib Global Marketing Models.
NOTE: The table spndac_train_topic_prop_v1TEST in caslib Global Marketing Models has 128 observations and 539 variables.
NOTE: Sentencia DATA used (Total process time):
real time 2.63 seconds
cpu time 0.04 seconds
SASKiwi
PROC Star

The SAS log you posted takes less than three seconds to run. Can you post an example of the slow program?

dcortell
Pyrite | Level 9

I have no log, as I run the piece of code:

 

80   data gmod.&train_table._2;
81 set gmod.&train_table;
82 format split $10.;
83 if email_dt<="&cutoff9"d then split="train";
84 else split="test";
85 run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads. 

It get stuck in running without completing 

Patrick
Opal | Level 21

I've done some testing in my environment (recent Viya 4 with 4 worker nodes and 192 threads) with a table with the same number of rows and columns as yours. I've made the assumption that most of your columns are numeric. In my environment your data step code executes within around 40 seconds. 

 

Here the test code I've used:

Spoiler
options msglevel=i;

cas mysess;

%let sessref=mysess;
%let mypath=&_userhome;

/* define CAS lib and linked SAS lib */
caslib gmod 
  local
  sessref=mysess
  path="&mypath" 
  datasource=(srctype="path") 
  libref=gmod
  ;

%let train_table=demo;
%let cutoff=01jan2023;
%let n_vars=536;
%let n_rows=24160332;
/* %let n_rows=1000; */

/* create source table in CAS */
data gmod.&train_table._1;
  array vars_ {&n_vars} 8 (&n_vars*1234567890);
  format email_dt date9.;
  do i=1 to &n_rows;
    email_dt=rand('integer',-5,5)+"&cutoff"d;
    output;
  end;
run;

/* run data step as shared creating a new table in CAS */
data gmod.&train_table._2(compress=yes)/sessref=&sessref;
  set gmod.&train_table._1;
  format split $10.;
  if email_dt<="&cutoff"d then split="train";
  else split="test";
run;

/* run data step as shared creating a new table in CAS */
data gmod.&train_table._2_uncompressed/sessref=&sessref;
  set gmod.&train_table._1;
  format split $10.;
  if email_dt<="&cutoff"d then split="train";
  else split="test";
run;

/* run data step as shared replacing the existing table in CAS */
data gmod.&train_table._1(compress=yes) /sessref=&sessref;
  set gmod.&train_table._1;
  format split $10.;
  if email_dt<="&cutoff"d then split="train";
  else split="test";
run;

/* free up memory: Delete no more required CAS tables */
proc cas;
  session="mysess";
  table.dropTable /
  caslib="gmod",
  name="&train_table._1",
  quiet=TRUE
  ;
  run;
quit;

cas mysess terminate;

And here the SAS log

Spoiler
1    ***SAS Studio header code***
79   
80   options msglevel=i;
81   
82   cas mysess;
NOTE: The session MYSESS connected successfully to Cloud Analytic Services sas-cas-server-default-client using port ****. The UUID 
      is ****. The user is **user** and the active caslib is 
      CASUSER(**user**).
NOTE: The SAS option SESSREF was updated with the value MYSESS.
NOTE: The SAS macro _SESSREF_ was updated with the value MYSESS.
NOTE: The session is using 4 workers.
83   
84   %let sessref=mysess;
85   %let mypath=&_userhome;
86   
87   /* define CAS lib and linked SAS lib */
88   caslib gmod
89     local
90     sessref=mysess
91     path="&mypath"
92     datasource=(srctype="path")
93     libref=gmod
94     ;
NOTE: 'GMOD' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'GMOD'.
NOTE: CASLIB GMOD for session MYSESS will be mapped to SAS Library GMOD.
NOTE: Action to ADD caslib GMOD completed for session MYSESS.
95   
96   %let train_table=demo;
97   %let cutoff=01jan2023;
98   %let n_vars=536;
99   %let n_rows=24160332;
100  /* %let n_rows=1000; */
101  
102  /* create source table in CAS */
103  data gmod.&train_table._1;
104    array vars_ {&n_vars} 8 (&n_vars*1234567890);
105    format email_dt date9.;
106    do i=1 to &n_rows;
107      email_dt=rand('integer',-5,5)+"&cutoff"d;
108      output;
109    end;
110  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step has no input data set and will run in a single thread.
NOTE: The table demo_1 in caslib GMOD has 24160332 observations and 538 variables.
NOTE: DATA statement used (Total process time):
      real time           4:19.64
      cpu time            0.37 seconds
      
111  
112  /* run data step as shared creating a new table in CAS */
113  data gmod.&train_table._2(compress=yes)/sessref=&sessref;
114    set gmod.&train_table._1;
115    format split $10.;
116    if email_dt<="&cutoff"d then split="train";
117    else split="test";
118  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: There were 24160332 observations read from the table DEMO_1 in caslib GMOD.
NOTE: The table demo_2 in caslib GMOD has 24160332 observations and 539 variables.
NOTE: DATA statement used (Total process time):
      real time           39.41 seconds
      cpu time            0.07 seconds
      
119  
120  /* run data step as shared creating a new table in CAS */
121  data gmod.&train_table._2_uncompressed/sessref=&sessref;
122    set gmod.&train_table._1;
123    format split $10.;
124    if email_dt<="&cutoff"d then split="train";
125    else split="test";
126  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: There were 24160332 observations read from the table DEMO_1 in caslib GMOD.
NOTE: The table demo_2_uncompressed in caslib GMOD has 24160332 observations and 539 variables.
NOTE: DATA statement used (Total process time):
      real time           2:22.31
      cpu time            0.18 seconds
      
127  
128  /* run data step as shared replacing the existing table in CAS */
129  data gmod.&train_table._1(compress=yes) /sessref=&sessref;
130    set gmod.&train_table._1;
131    format split $10.;
132    if email_dt<="&cutoff"d then split="train";
133    else split="test";
134  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: There were 24160332 observations read from the table DEMO_1 in caslib GMOD.
NOTE: The table demo_1 in caslib GMOD has 24160332 observations and 539 variables.
NOTE: DATA statement used (Total process time):
      real time           44.76 seconds
      cpu time            0.09 seconds
      
135  
136  /* free up memory: Delete no more required CAS tables */
137  proc cas;
138    session="mysess";
139    table.dropTable /
140    caslib="gmod",
141    name="&train_table._1",
142    quiet=TRUE
143    ;
144    run;
NOTE: Active Session now MYSESS.
145  quit;
NOTE: PROCEDURE CAS used (Total process time):
      real time           0.25 seconds
      cpu time            0.00 seconds
      
146  
147  cas mysess terminate;
NOTE: Libref GMOD has been deassigned.
NOTE: Deletion of the session MYSESS was successful.
NOTE: The default CAS session MYSESS identified by SAS option SESSREF= was terminated. Use the OPTIONS statement to set the 
      SESSREF= option to an active session.
NOTE: Request to TERMINATE completed for session MYSESS.
148  
149  *** SAS Studio trailer code ***160  

Findings in my environment:

  1. A data step that creates a new CAS table instead of overwriting an existing CAS table performed a bit better. 
  2. Compression of the CAS table increased performance quite a bit despite the overhead for compression/de-compression (40 sec as compared to 2min 20 sec).
    1. If compression will be beneficial depends on the table/columns and usage.
      https://go.documentation.sas.com/doc/en/pgmsascdc/v_045/casref/p1mj007d8jq6swn1kwfysjl72fxh.htm 

Further considerations:

  1. CAS memory comes at a premium. If there is not enough available then paging happens which leads to very significant performance degradation
    https://communities.sas.com/t5/SAS-Communities-Library/4-Rules-to-Understand-CAS-Management-of-In-Me...
  2. With SAS9.4 many users never clean-out tables in WORK - but WORK just fills-up disk space that gets cleaned-up once the SAS session terminates. 
    With CAS tables it's memory and it is paramount to preserve memory so you don't run out of it and then get into "paging mode". 
  3. You've got the following options to remove session scope tables that aren't needed anymore:
    1. Terminate the CAS session
    2. Delete the table explicitly (one coding option in the sample code provided)
    3. Define a lifetime for the table  https://communities.sas.com/t5/SAS-Communities-Library/Tired-of-deleting-temporary-CAS-tables-Let-CA... 
  4. Only keep the variables in the CAS table that are required
  5. Depending on use case only create a single copy of the table by for example using CAS data step option copies=1 (installation default is 2 afaik)
    https://communities.sas.com/t5/SAS-Communities-Library/CAS-data-distribution-DUPLICATE-a-REPLICATION... 

If you really can't get better performance then talk to your SAS admin so he/she can investigate how much CAS memory is available (it's shared so other people can consume too much or there are too many promoted tables loaded) and verify that CAS_DISK_CACHE is configured as it should be.

 

 

 

 

dcortell
Pyrite | Level 9

So I tired to generate a new table from the existing one using the datastep, and In addition to the loading time, this time also an error triggered:

 

80   data gmod.&train_table._2;

81   set gmod.&train_table;

82   format split $10.;

83   if email_dt<="&cutoff9"d then split="train";

84   else split="test";

85   run;

 

NOTE: Running DATA step in Cloud Analytic Services.

NOTE: The DATA step will run in multiple threads.

ERROR: Cloud Analytic Services failed writing to system disk space. Please contact your administrator.

ERROR: Cloud Analytic Services failed writing to system disk space. Please contact your administrator.

ERROR: Cloud Analytic Services failed writing to system disk space. Please

 

I will probably check with my admin what is going on here

Patrick
Opal | Level 21

Yes, talk to your SAS Admin. This error indicates very much that there is some resource constraint which needs resolution.
https://sas.service-now.com/csm?id=kb_article_view&sysparm_article=KB0036522 

 

Please share what the root cause of the issue was once your SAS Admin got to the bottom of it.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1294 views
  • 4 likes
  • 4 in conversation