We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Executing SAS Data Step in CAS

by SAS Employee UttamKumar on ‎05-25-2017 10:07 AM (777 Views)

The SAS DATA step can be used to manipulate and prepare data for analysis and predictive modeling. In SAS Viya, the DATA step runs in both a traditional SAS client session as well as in SAS Cloud Analytic Services (CAS). The single program and multiple data paradigm (SPDM) is used to run a DATA step in a CAS server session. When a DATA step runs in CAS, the same program executes in multiple threads. Each thread operates on only part of the data. The benefit of running a DATA step in CAS, compared to Base SAS®, is that multiple cores are used to process the DATA step and there is more input/output bandwidth available using multiple machines. The DATA step does not restrict the number of input or output tables. However, there may be practical limits imposed by memory and other operating system limitations.

 

The environment variable option DSACCEL=ANY enables users to run a DATA step in CAS. By default, this option is turned on to execute the DATA step in CAS, provided the input and output table references in the DATA step are associated with a CAS server libref. When the input or output table is a non CAS table, the DATA step is executed on the client side (SAS workspace server). During client-side execution, the DATA step pulls the data from CAS into the client side, does the calculations, prepares a new table, and stores/sends it to CAS to store into a new CAS table, if the output is a CAS table.  

 

The following example demonstrates the execution of a DATA step in the CAS server session. Both the input and the output tables are in a CAS engine libref.

 

64 /* Convert MPG to KPL into new CAS table*/

65 data casdata.cars_kpl ;

66 set casdata.cars;

67 KPL_City = 0.425 * MPG_City;

68 KPL_Highway = 0.425 * MPG_Highway;

69 run;

 

NOTE: Running DATA step in Cloud Analytic Services.

NOTE: There were 428 observations read from the table CARS in caslib CASUSER(sasdemo).

NOTE: The table cars_kpl in caslib CASUSER(sasdemo) has 428 observations and 17 variables.

NOTE: DATA statement used (Total process time):

real time 0.16 seconds

cpu time 0.00 seconds    

 

 

The following example demonstrates the execution of a DATA step in the SAS client (SAS workspace server). Both the input data set and the output tables are not in a CAS engine libref.

 

58 /* Create a new CAS table by concatenting two table, a table from CAS and a table (non CAS table) from Client library */

59 data casdata.cars_new;

60 set casdata.cars sashelp.cars ;

61 KPL_City = 0.425 * MPG_City;

62 KPL_Highway = 0.425 * MPG_Highway;

63 run;

 

NOTE: To run DATA step in Cloud Analytic Services a CAS engine libref must be used with all data sets and all librefs in the program must refer to the same session.

NOTE: Could not execute DATA step code in Cloud Analytic Services. Running DATA step in the SAS client.  

 

 

Impact of the MaxTableMem and the “copies” parameters

The MaxTableMem property has no noticeable effect on the execution of a DATA step in CAS. Whether you keep the value for MaxTableMem as low as ~16 MB or as high as ~16 GB, this parameter does not impact the execution time of a DATA step in a CAS server session.  

 

However, the value assigned to the “copies” parameter for the CAS output table in a DATA step significantly impacts the execution time. In a multi-node CAS environment, if you keep the default value copies=1 for the CAS output table, the execution of a DATA step takes more time. In this case, most of the time is taken by the data movement between the nodes while creating the output table. If you are creating a DATA step program with many steps, it is recommended that you specify copies=0 in all the steps except the last one, assuming that you want to keep the final results as a CAS table.  

 

The following example demonstrates the impact of the value that you assign to the “copies” parameter while executing a DATA step in CAS. We have loaded a data table that contains 100 million rows to a multi-node CAS environment in order to execute a DATA step against the same table with copies=0 and copies=1 for the CAS output table. In the log, you can notice that when a DATA step is executed with copies=0, there is no data movement between the nodes and execution takes less time. Whereas, when a DATA step is executed on the same table with the default value copies=1, there is data movement between the nodes and, hence, it takes more time to execute.  

 

The following example demonstrates the DATA step processing in CAS with copies=0:

 

58 data casuser.junk2(copies=0 promote=yes );

59 set casuser.CASJunkTable;

60 l = k*2;

61 m = l/j;

62 run;

 

NOTE: Running DATA step in Cloud Analytic Services.

NOTE: The DATA step will run in multiple threads.

NOTE: Executing action 'dataStep.runBinary'.

NOTE: Executing action 'table.promote'.

NOTE: Action 'table.promote' used (Total process time):

NOTE: real time 0.080657 seconds

NOTE: cpu time 0.188588 seconds (233.81%)

NOTE: total nodes 4 (8 cores)

NOTE: total memory 62.06G

NOTE: memory 364.97K (0.00%)

NOTE: There were 100000000 observations read from the table CASJUNKTABLE in caslib CASUSER(sasdemo).

NOTE: The table junk2 in caslib CASUSER(sasdemo) has 100000000 observations and 5 variables. NOTE: Action 'datastep.runBinary' used (Total process time):

NOTE: real time 7.845770 seconds

NOTE: cpu time 40.669176 seconds (518.36%)

NOTE: total nodes 4 (8 cores)

NOTE: total memory 62.06G

NOTE: memory 364.97K (0.00%)

NOTE: DATA statement used (Total process time):

real time 8.00 seconds

cpu time 0.02 seconds    

 

 

 

The following example demonstrates the DATA step processing in CAS with copies=1:

 

59 data casuser.junk3(promote=yes);

60 set casuser.CASJunkTable;

61 l = k*2;

62 m = l/j;

63 run;

 

NOTE: Running DATA step in Cloud Analytic Services.

 

NOTE: The DATA step will run in multiple threads.

NOTE: Executing action 'dataStep.runBinary'.

NOTE: Executing action 'table.promote'.

NOTE: Action 'table.promote' used (Total process time):

NOTE: real time 0.106011 seconds

NOTE: cpu time 0.111325 seconds (105.01%)

NOTE: total nodes 4 (8 cores)

NOTE: total memory 62.06G

NOTE: memory 566.66K (0.00%)

NOTE: There were 100000000 observations read from the table CASJUNKTABLE in caslib CASUSER(sasdemo).

NOTE: The table junk3 in caslib CASUSER(sasdemo) has 100000000 observations and 5 variables. NOTE: Action 'datastep.runBinary' used (Total process time):

NOTE: real time 38.682911 seconds

NOTE: cpu time 57.775873 seconds (149.36%)

NOTE: data movement time 31.019510 seconds

NOTE: total nodes 4 (8 cores)

NOTE: total memory 62.06G

NOTE: memory 566.66K (0.00%)

NOTE: bytes moved 3.73G

 

NOTE: DATA statement used (Total process time):

real time 38.85 seconds

cpu time 0.01 seconds  

 

 

Other noticeable limitations

While running a DATA step in a CAS server session, only the double, character, and VARCHAR data types are supported. Other CAS data types in input tables are converted to one of these three supported types.  

 

When running a DATA step in multiple threads, input table rows are divided among threads. Each thread of a DATA step sees only a part of the data, not the entire table. When dividing data among several threads of a DATA step, the results may not be the same as when only one thread is used. For example, when a RETAIN statement is used, a value is retained from one row to the next. Often, this approach is used to create a sum from all rows. Because each thread operates on only part of the data, each thread holds and stores a partial sum.  

 

The full functionality of a Base SAS DATA step is not yet supported in CAS. For more information about executing a DATA step in CAS, a list of limitations, and unexpected discoveries, see the Saspedia page Data step in CAS and Run a DATA step in CAS  

Comments
by New Contributor tkscuba
on ‎06-01-2017 03:14 PM

Thanks Uttam for the post.  We are testing with CAS and loading sas7bdat files in with both proc cas and casutil into CAS.   The system we are using is a 5 node CAS system (1 controller, 4 workers).  Our CAS_DISK_CACHE is very fast, we are using an EMC D5 and gpfs on D5 for our shared file system (input).   We too see the copies=0 performance boost.  What actually happens is that copies=1 (default) forces the file to be read in to RAM (CAS in-memory), but then its copied down to CAS_DISK_CACHE (only as fast as your storage can write), and then CAS replicates the rows across the network (another potentially slow part).  Our system is really hurting with copies=1 as we only have 1 GbE between workers.  10 GbE would help us a bit.   

 

We do see a 25% increase in runtime if we don't set maxtablemem LARGER than the file size.   We also can easily watch iostat and see massive IO to the CAS_DISK_CACHE when maxtablemem is smaller than the file size.  This is because sas CAS is making a copy down on disk.  This is NOT optimal all the time, especially with VERY large files and when you only want to keep them in memory temporarily.  Also, many customers don't have FAST CAS_DISK_CACHE.  If you need to write to CAS_DISK_CACHE, you better have fast disks (SSD) or stripped RAID storage or you might find that maxTableMem really does matter.   You might get lucky and see decent performance for CAS_DISK_CACHE with single disk for the cache.. but only with a single user.... wait until you start thrashing the poor CAS_CACHE_DISK as multiple instances try to use CACHE.  

 

Just wanted to say that your mileage may vary and that setting maxTableMem may be really important for performance, especially if you are treating CAS like SASWORK and don't want to waste disk space and have lots of RAM for CAS.

by SAS Employee UttamKumar
on ‎06-01-2017 04:17 PM

Hi tkscuba,

 

Thanks for reading the post and sharing your experience on this topic. Your comments cetainly is helpful for other cumminity members.

 

The maxtablemem parameter is just a hint to keep the data in memory. If environment does not have avaibale memory, it will write to CAS_DISK_CACHE location, even you keep maxtablemem LARGER than the file size. The parameter is meant for basically decide the block/file size when data loaded to CAS.

 

Regards,

-Uttam   

by New Contributor tkscuba
on ‎06-01-2017 04:43 PM
Totally agree with that. We have lots of memory and I think some folks will use it more as a SAS work. Hopefully there will be some good best practices on CAS_DISK_CACHE design from a systems standpoint.
Hope you are good!

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.