Accessing SPDE files with CAS

3 Likes

The SAS Scalable Performance Data Engine (SPD Engine) has been introduced a long time ago in SAS 9.1. And it has been extended to the use of HDFS as the storage layer in SAS9.4M1. Some of you have been using it for a while and have probably invested some time and effort on it. If you move to Viya and CAS, how are you going to leverage SPDE data? I’ll try to provide an overview of accessing SPDE files from CAS in this article.

When using the SPD Engine, one can decide to store data on OS directories or on HDFS. Let’s start by the OS directories case.

SPDE on OS directories

In order to review what the main components of SPDE data are, here is an example of a standard libname statement that assigns an SPDE engine library.

libname spdefs spde "/opt/sas/spde/meta"  
   datapath=("/opt/sas/spde/data/data1",
             "/opt/sas/spde/data/data2",
             "/opt/sas/spde/data/data3")  
   indexpath=("/opt/sas/spde/index/index1",
              "/opt/sas/spde/index/index2") ;

An SPDE library is composed of:

A physical location for the metadata files of the SPDE tables (file descriptor), defined through the primary path (metapath option could also be used to set additional paths for the metadata)
Possibly multiple physical locations for the data files (partitions), defined with the datapath option
Possibly multiple physical locations for the index files, defined with the indexpath option

If you want to access the same data from CAS, you have to define a CASLIB the following way:

caslib cas_spde_fs desc="SPDE Files on CAS Controller" 
   dataSource=(srctype="spde",
               dataTransferMode="serial",
               mdfpath="/opt/sas/spde/meta",
               datapath="/opt/sas/spde/data/data4") ;

Notice that:

The mdfpath option points to the full primary path of the SPDE library.
The datapath is not required and not used for reading files. CAS gets the data using the metadata file information. Datapath, in the CASLIB context, is designed to define the target location when you want to save a CAS table to SPDE. Only a single path is supported.
There is no indexpath. CAS knows the indexes using the metadata file information and uses them for serial load if a where clause is specified. But CAS is not able to create SPDE indexes when saving a CAS table.

You might wonder how you can access SPDE files since they are located on the SAS 9 server which is probably a separate machine?

That’s a good question. Either you have to move/copy them from your SAS 9 environment to the CAS Controller or you have to make them available from the CAS Controller through a shared file system.

Since CAS relies on the metadata component file (.mdf file) to find the data files, you must make the SPDE data and index files available on the CAS Controller (copy or share) in the exact same directory as they were stored on the SAS 9 machine. Otherwise, you can get this kind of message:

ERROR: Unable to obtain file system status for path /tmp/spde/data/data1/prdsale.dpf.00002aac.0.2.spds9: Reason=File was not found.

Which CAS loading mode do SPDE on OS directories support?

Serial for sure. SPDE files must be available from the CAS Controller (or the single CAS server if CAS is SMP). SPDE indexes are used (serial load only) if a where clause is specified.

Multi-node as well on MPP CAS. If the SPDE files are accessible from all the CAS nodes, Controller + Workers, then a multi-node load is possible with the use of the NUMREADNODES option.

I said all the CAS nodes. If the SPDE files are only available on a few nodes but not all, then multi-node is not possible at all, even on those few nodes.

NUMREADNODES default value is 0 (in contrast with other data connectors where it is set to 1). That means CAS will by default try to load the SPDE data in multi-node mode with all the possible workers:

 73         caslib cas_spde_fs desc="SPDE Files on all CAS nodes"
 74            dataSource=(srctype="spde",
 75                        dataTransferMode="serial",
 76                        mdfpath="/opt/sas/spde/meta",
 77                        datapath="/opt/sas/spde/data/data4",
 78                        dfdebug=sqldetails) ;
 NOTE: Executing action 'table.addCaslib'.
 NOTE: 'CAS_SPDE_FS' is now the active caslib.
 NOTE: Cloud Analytic Services added the caslib 'CAS_SPDE_FS'.
 NOTE: Action 'table.addCaslib' used (Total process time):
 NOTE:       real time               0.057634 seconds
 NOTE:       cpu time                0.060187 seconds (104.43%)
 NOTE:       total nodes             8 (32 cores)
 NOTE:       total memory            250.11G
 NOTE:       memory                  5.90M (0.00%)
 NOTE: Action to ADD caslib CAS_SPDE_FS completed for session MYSESSION.

 79         proc casutil incaslib="cas_spde_fs" outcaslib="cas_spde_fs" ;
 NOTE: The UUID '6530bd17-5ddf-814a-b156-1e02f4b344ef' is connected using session MYSESSION.
 80            load casdata="prdsale" casout="prdsale" replace ;
 NOTE: Executing action 'table.loadTable'.
 NOTE: Performing serial LoadTable action using SAS Data Connector to SPDE.
 NOTE: Reading Rows from all worker nodes.
 NOTE: Up to 205716 rows will be read each by 7 workers for a total of 1440000 rows.
 NOTE: Worker #0 will read obs 1 to 205716.
 NOTE: Worker #1 will read obs 205717 to 411432.
 NOTE: Worker #2 will read obs 411433 to 617148.
 NOTE: Worker #3 will read obs 617149 to 822864.
 NOTE: Worker #4 will read obs 822865 to 1028580.
 NOTE: Worker #5 will read obs 1028581 to 1234296.
 NOTE: Worker #6 will read obs 1234297 to 1440000.
 NOTE: Cloud Analytic Services made the external data from prdsale available as table PRDSALE in caslib cas_spde_fs.
 NOTE: Action 'table.loadTable' used (Total process time):
 NOTE:       real time               0.368830 seconds
 NOTE:       cpu time                1.694141 seconds (459.33%)
 NOTE:       total nodes             8 (32 cores)
 NOTE:       total memory            250.11G
 NOTE:       memory                  149.84M (0.06%)
 NOTE: The Cloud Analytic Services server processed the request in 0.36883 seconds.
 81         run ;
 82         quit ;
 
 NOTE: PROCEDURE CASUTIL used (Total process time):
       real time           0.37 seconds
       cpu time            0.00 seconds

Saving CAS tables to SPDE on OS directories?

Yes, it’s possible. The metadata file will be created in the mdfpath and, as mentioned earlier, the single-path CASLIB datapath option will be used for writing the SPDE data files.

As a result, a saved CAS table will only make use of a single path for data files, not multiple. At the end, if you have SPDE tables created natively or from CAS in the same SPDE library, you will see an unbalanced usage of your multiple paths, because CAS only uses one.

SPDE tables created from CAS can be shared back with your SAS 9 environment.

But be careful, CAS uses “UTF-8” as the default encoding for managing data. If your SAS 9 environment is on the latest maintenance (9.4M5), there shouldn’t be any issue reading these SPDE files created by CAS (CEDA has been implemented in 9.4M5 for SPDE files). If your SAS session is 9.4M4 or earlier, then you might have some problems to read these files because only a SAS session started with the “UTF-8” encoding will be able to read those “UTF-8” files.

Data is saved serially. Multi-node save is not supported (there is no NUMWRITENODES option).

It’s not possible to create SPDE indexes directly from CAS.

SPDE on HDFS

What about SPDE on HDFS?

Here is a libname statement example that defines an SPDE library on HDFS:

option set=SAS_HADOOP_CONFIG_PATH="/opt/MyHadoop/CDH/Config" ; 
option set=SAS_HADOOP_JAR_PATH="/opt/MyHadoop/CDH/Jars" ;

libname spdehdfs spde "/casdm/spde/meta"  
   datapath=("/casdm/spde/data")  
   indexpath=("/casdm/spde/index") 
   hdfs=yes
   hdfsuser=sastest1 ;

Notice:

The 2 global options to define how to contact the Hadoop cluster (Hadoop config files and jars)
A single HDFS path datapath option (no multiple paths support, which makes no sense in HDFS anyways)
A single HDFS path indexpath option (no multiple paths option)
The hdfs option to specify to use HDFS for storage
Optionally the hdfsuser option to use a specific HDFS user to connect to Hadoop

If you want to access the same data from CAS, you have to define a CASLIB the following way:

caslib cas_spde_hdfs desc="SPDE Files on Hadoop HDFS" 
   dataSource=(srctype="spde",
      username="sastest1",
      hdfs="true",
      dataTransferMode="serial",
      hadoopconfigdir="/opt/MyHadoop/CDH/Config",
      hadoopjarpath="/opt/MyHadoop/CDH/Jars",
      mdfpath="/casdm/spde/meta",
      datapath="/casdm/spde/data") ;

What is interesting with the usage of HDFS for SPDE files in this case is that you don’t have to move or share the data between your SAS 9 environment and your CAS environment. By definition, SPDE files on HDFS are shared and accessible from any environment properly configured.

Which CAS loading mode do SPDE on HDFS support?

Serial for sure. SPDE files on HDFS must be accessible from the CAS Controller (or from the single CAS server if CAS is SMP). SPDE indexes are used (serial load only) if a where clause is specified.

Multi-node as well on MPP CAS. If the HDFS cluster is accessible from all the CAS nodes, Controller + Workers, then a multi-node load is possible with the use of the NUMREADNODES option.

Again, I said all the CAS nodes. If the HDFS cluster is accessible from a few nodes but not all, then multi-node is not possible at all, even on those few nodes.

NUMREADNODES default value is 0 (in contrast with the other data connectors where it is set to 1). That means CAS will by default try to load the SPDE data in multi-node mode with all the possible workers.

Parallel load is also possible on MPP CAS. If your license includes “SAS In-Database Technologies for Hadoop” (which unlocks the “SAS Data Connect Accelerator for Hadoop” component and the use of the SAS Embedded Process on Hadoop) then loading data in CAS in parallel from SPDE files on HDFS is possible:

 73         caslib cas_spde_hdfs desc="SPDE Files on Hadoop HDFS"
 74            dataSource=(srctype="spde",
 75               username="sastest1",
 76               hdfs="true",
 77               dataTransferMode="parallel",
 78               hadoopconfigdir="/opt/MyHadoop/CDH/Config",
 79               hadoopjarpath="/opt/MyHadoop/CDH/Jars",
 80               mdfpath="/casdm/spde/meta",
 81               datapath="/casdm/spde/data",
 82               dfdebug=epall) ;
 NOTE: Executing action 'table.addCaslib'.
 NOTE: 'CAS_SPDE_HDFS' is now the active caslib.
 NOTE: Cloud Analytic Services added the caslib 'CAS_SPDE_HDFS'.
 NOTE: Action 'table.addCaslib' used (Total process time):
 NOTE:       real time               0.060069 seconds
 NOTE:       cpu time                0.060475 seconds (100.68%)
 NOTE:       total nodes             8 (32 cores)
 NOTE:       total memory            250.11G
 NOTE:       memory                  5.91M (0.00%)
 NOTE: Action to ADD caslib CAS_SPDE_HDFS completed for session MYSESSION.

 83         proc casutil incaslib="cas_spde_hdfs" outcaslib="cas_spde_hdfs" ;
 NOTE: The UUID '0d54fe84-de72-0c4b-a86b-fd9b48d2e8c2' is connected using session MYSESSION.
 84            load casdata="prdsale" casout="prdsale" replace ;
 NOTE: Executing action 'table.loadTable'.
 NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for SPDE.
 NOTE: SAS Embedded Process tracking URL: http://myhadoop.example.com:8088/proxy/application_1526898318428_0005/
 NOTE: Job Status ......: SUCCEEDED
 NOTE: Job ID ..........: 1526898318428_5
 NOTE: Job Name ........: SASEP SuperReader /casdm/spde/data/prdsale_spde
 NOTE: File splits..... : 3
 NOTE: Input records ...: 1440000
 NOTE: Input bytes .....: 149760000
 NOTE: Output records ..: 0
 NOTE: Output bytes ....: 0
 NOTE: Transcode errors : 0
 NOTE: Truncations .....: 0
 NOTE: Map Progress ....: 100.00%
 NOTE: Cloud Analytic Services made the external data from prdsale available as table PRDSALE in caslib cas_spde_hdfs.
 NOTE: Action 'table.loadTable' used (Total process time):
 NOTE:       real time               28.483150 seconds
 NOTE:       cpu time                2.399137 seconds (8.42%)
 NOTE:       total nodes             8 (32 cores)
 NOTE:       total memory            250.11G
 NOTE:       memory                  114.89M (0.04%)
 NOTE: The Cloud Analytic Services server processed the request in 28.48315 seconds.
 85         run ;
 86         quit ;
 
 NOTE: PROCEDURE CASUTIL used (Total process time):
       real time           28.49 seconds
       cpu time            0.01 seconds

Notice that the “SAS In-Database Technologies for Hadoop” license requires the “SAS/ACCESS Interface to Hadoop” license.

Saving CAS tables to SPDE on HDFS?

Again, it’s possible. CAS will make use of the HDFS mdfpath and datapath option to save the CAS table to HDFS in the SPDE format.

SPDE tables on HDFS created from CAS can be shared back with your SAS 9 environment, with the same drawbacks about encoding as SPDS files on disk:

If SAS 9.4M5, everything should be ok (CEDA)
If SAS 9.4M4 or earlier, the SAS session needs to be started with the “UTF-8” encoding

Data is saved serially or in parallel (if “SAS Data Connect Accelerator for Hadoop” is licensed and SAS Embedded Process is deployed). Multi-node save is not supported (there is no NUMWRITENODES option).

It’s not possible to create SPDE indexes directly from CAS, even on HDFS.

Additional considerations

Partition Size

The default partition size used by the SPD libname engine is 128 MB. This value can be changed using the PARTSIZE option in the libname statement.

When CAS creates an SPDE table (saving a CAS table to an SPDE data source), the default partition size is 16 MB. In order to align with the default SPDE value, you may want to use the PARTSIZE option in the CASLIB statement:

caslib cas_spde_fs desc="SPDE Files on CAS Controller" 
   dataSource=(srctype="spde",
               dataTransferMode="serial",
               mdfpath="/gelcontent/demo/DM/data/SPDE2/meta",
               datapath='/gelcontent/demo/DM/data/SPDE2/data/data4',
               dfdebug=sqldetails,
               partsize=128M) ;

The CASLIB PARTSIZE option is ignored and not supported for dataTransferMode="parallel" on HDFS (saving a CAS table in parallel using SAS EP). In this case, partitions could have multiple sizes depending on how the SAS Embedded Process receives data.

Licensing

Access to SPDE data from CAS is enabled through the use of technical components known as “SAS Data Connector to SPD Engine Files” and “SAS Data Connect Accelerator for SPD Engine Files”. They are included in every order that includes CAS and don’t require any specific license.

However, the second one (“SAS Data Connect Accelerator for SPD Engine Files”), which allows to load SPDE data located in HDFS in CAS in parallel (or to save CAS data to SPDE files in HDFS in parallel) using the SAS EP deployed in Hadoop, relies on:

The “SAS Data Connect Accelerator for Hadoop” component which requires the “SAS In-Database Technologies for Hadoop” license
And the “SAS In-Database Technologies for Hadoop” license requires the “SAS/ACCESS Interface to Hadoop” license

So, in other words, you need both “SAS/ACCESS Interface to Hadoop” and “SAS In-Database Technologies for Hadoop” to load in parallel in CAS SPDE data located on HDFS.

Encoding

As mentioned earlier, the encoding of SPDE files could be an issue for sharing files, especially when you want to move SPDE files created by CAS to the SAS 9 world. SAS 9.4M5 gives more flexibility (but does not solve every case) as it now supports CEDA for SPDE files. Look at the documentation for more information.

Thanks for reading.

LinusH · ‎10-29-2018

Nice write-up, especially if already have legacy SPDE data-

But what if you start with clean Viya environment, or have (or planned) to fully migrate from 9.x?

Is the beneficial in Viya compared to other storage engines?

Or is just to cope with legacy data stores?

I get some hints of the limitation of the OS directory case...

JuanS_OCS · ‎10-30-2018

Hello @LinusH,

In my case, I find it useful because a customer I have, uses GRID with a lot of historical data (terabytes) stored in SPDE, and pushes some data into SAS VA 7.x. Soon there will be a full change, and Viya will be introduced.

The migration of all the data might take even longer than the technical migration, hence we are figuring out the best way to make the change. And there is a requirement to do it in relatively short term, for the size of this SAS Platform and data.

Hence, reading SPDE tables from Viya, it's something I will be looking at, very closely.

NicolasRobert · ‎10-30-2018

Hello @LinusH and @JuanS_OCS

Thanks for your comments.

I would say that CAS' support for SPDE data is mainly targeted for facilitating data movement/migration between SAS 9.4 and Viya, especially if a customer has used SPDE a lot for storing SAS data. Migration of huge volumes of existing SPDE data is one scenario. Co-existence of both environments is another scenario (exchange of data between both environments).

From a pure Viya/CAS perspective, there are new storage techniques, based on the new default SASHDAT file format that are faster (to access and load into CAS) than SPDE. So on a standalone Viya/CAS environment, there is no reason to persist/save CAS tables as SPDE files.

Regards.