CAS load from a HDMD table

1 Like

With SAS Viya 3.3, the SAS Data Connector for Hadoop and SAS Data Connect Accelerator for Hadoop can read data from SAS HDMD tables stored in HDFS, in addition to Hive tables. A SAS HDMD table is an XML-Based metadata generated by PROC HDMD that describe the contents of a file stored in HDFS.

This enables you to read data from delimited, Binary, and XML formatted files stored in HDFS (Hadoop) and load it to CAS using serial or parallel mode. You can use SAS Data connect accelerator and SAS EP to parallel load CAS table from HDMD tables.

The SAS Data Connector and Data Connect Accelerator for Hadoop are part of SAS Viya version of SAS/ACCESS interface to Hadoop, which is used for serial and parallel data read and write between CAS and various data servers.

SAS PROC HDMD

The SAS PROC HDMD generates XML-Based metadata that describes the contents of a data file stored in HDFS. The file-type for an XML-based metadata generated by PROC HDMD is SASHDMD (i.e. cust_table.sashdmd). This metadata, also known as SASHDMD descriptor, is stored in HDFS. This metadata enables the SAS/ACCESS Interface to Hadoop and the SAS high-performance procedure to read Hadoop data directly without an intermediate metadata repository such as Hive.

When HDFS_METADIR= specified in LIBANME connection option, SAS/ACCESS does not connect to Hive, but instead looks for an XML-based HDMD metadata file. When SAS PROC HDMD is used with a Hadoop LIBNAME, it creates an SASHDMD file containing metadata.

The SAS PROC HDMD can be used to describe the columns in an HDFS data file or an HDFS folder containing a list of data files with same file format. When it describes one HDFS file, the SASHDMD descriptor contains the complete file path. When it describes an HDFS folder, the SASHDMD descriptor contains an HDFS directory path:

PROC HDMD can be used to describe tabular HDFS files for the following formats:

Delimited text file
Binary - Fixed record length data file
XML- encoded text file.

Examples:

The following example describe the creation of a HDMD metadata for a “|” delimited HDFS file using PROC HDMD.

File location on HDFS: /user/utkuma/hdmd/data/pipedata_dept.txt

Data file contents:

PROC HDMD statement:

OPTIONS SET=SAS_HADOOP_JAR_PATH="/opt/MyHadoop/CDH/Jars";
OPTIONS SET=SAS_HADOOP_CONFIG_PATH="/opt/MyHadoop/CDH/Config"; 

libname HDP hadoop  user=utkuma    server='sascdh01.race.sas.com' 
       hdfs_metadir = '/user/utkuma/hdmd/meta' 
       hdfs_datadir = '/user/utkuma/hdmd/data' ;

proc hdmd name=hdp.pipedata_dept 
  file_format=delimited    sep = '|' 
  DATA_FILE='pipedata_dept.txt' ; 
    COLUMN DeptId int; 
    COLUMN DeptName char(15); 
run;

The following example describes the creation of a HDMD metadata using PROC HDMD for a HDFS folder containing list of “|” delimited data files.

Folder and file location on HDFS:

[hadoop@sascdh01 ~]$ hadoop fs -ls /user/utkuma/hdmd/data/dept
Found 5 items
-rw-r--r--   3 utkuma sasusers         75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept0.txt
-rw-r--r--   3 utkuma sasusers         75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept1.txt
-rw-r--r--   3 utkuma sasusers         75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept2.txt
-rw-r--r--   3 utkuma sasusers         75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept3.txt
-rw-r--r--   3 utkuma sasusers         75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept4.txt

PROC HDMD statement:

…
……
proc hdmd name=hdp.deptM 
  file_format=delimited   sep = '|' 
  DATA_DIR='dept' ;
    COLUMN DeptId int; 
    COLUMN DeptName char(15); 
run;

To view the metadata of a HDMD table, you can use ‘describe’ option with PROC HDMD.

73         proc hdmd name=hdp.deptM describe;
74         run;

 PROC HDMD NAME=HDP.DEPTM
     FILE_FORMAT=DELIMITED ENCODING=UTF8 SEP='|' BYTE_ORDER=LITTLEENDIAN
     FILE_TYPE=DELIMITED
     DATA_DIR='/user/utkuma/hdmd/data/dept'
     META_DIR='/user/utkuma/hdmd/meta';
 COLUMN /* 1 */ DeptId INT OFFSET=0 BYTES=4 CTYPE=INT32;
 COLUMN /* 2 */ DeptName CHAR(15) OFFSET=0 BYTES=15 CTYPE=CHAR;
 
 NOTE: PROCEDURE HDMD used (Total process time):

For information on SAS PROC HDMD and available options with each file format, see the documentation.

Data load from a HDMD table to CAS

The data from a SAS HDMD table can be loaded to CAS using Serial or Parallel method. For parallel load, you must have the SAS Data Connect Accelerator for Hadoop installed on CAS Controller and on each CAS Worker nodes along with SAS EP installed on Hadoop cluster nodes. You also need Hadoop client Jars and client Configuration files on CAS Controller node. For serial load you must have SAS Data Connector for Hadoop installed on CAS Controller node long with Hadoop client Jars and client Configuration files on CAS Controller node. You can also use the Multi-Node data transfer loading facility by installing Data Connector for Hadoop on each CAS Worker nodes along with the Hadoop client Jars and client Configuration files on each CAS worker nodes.

When defining a CASLIB with source type of Hadoop, you can specify HDFSMETADIR= and HDFSDATADIR= in connection options to enable the access of SAS HDMD metadata. When you specify HDFSMETADIR= in connection option, the Data Connector and Data Connect Accelerator does not connect to Hive, instead it looks for an XML-based HDMD metadata to access the HDFS data file.

The following code example describes the data load from a SAS HDMD table to CAS using “Parallel” data transfer mode. It assigns a CASLIB with parallel mode along with HDFSMETADIR= and HDFSDATADIR= options to enable the access of HDMD metadata. By using PROC CASUTIL and load table CAS action, it loads the data from HDMD table to CAS. You can notice the log says “Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop”.

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib MyHdmd desc='HdmdCaslib' 
	datasource=(srctype="hadoop",   server="sascdh01.race.sas.com",  username="utkuma",
	dataTransferMode="parallel",
	hadoopconfigdir="/opt/MyHadoop/CDH/Config",
	hadoopjarpath="/opt/MyHadoop/CDH/Jars",
	hdfsDataDir="/user/utkuma/hdmd/data",
	hdfsMetaDir="/user/utkuma/hdmd/meta");

proc casutil;
load casdata="deptm" casout="deptm" 
     incaslib="MyHdmd" outcaslib="MyHdmd"   
replace ;
run;

CAS mySession  TERMINATE;

Log extract from SAS code execution:

….
 84         
 85         proc casutil;
 NOTE: The UUID '42936e10-c499-0e4a-a764-5e405a91b8e4' is connected using session MYSESSION.
 86         load casdata="deptm" casout="deptm"
 87              incaslib="MyHdmd" outcaslib="MyHdmd"
 88              replace ;
 NOTE: Executing action 'table.loadTable'.
 NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
 NOTE: Cloud Analytic Services made the external data from deptm available as table DEPTM in caslib MyHdmd.
 NOTE: Action 'table.loadTable' used (Total process time):
 NOTE:       real time               29.974952 seconds
 NOTE:       cpu time                0.884905 seconds (2.95%)
 NOTE:       total nodes             8 (32 cores)
 NOTE:       total memory            250.11G
 NOTE:       memory                  64.41M (0.03%)
 NOTE: The Cloud Analytic Services server processed the request in 29.974952 seconds.
 89         run;
 …
 ..

If we look at the Hadoop cluster resource manager for corresponding job execution, it shows application executed with SAS EP.

If we look at the log from executed mapper process, it also shows that SAS EP was used while running data feed to CAS.

……..
log4j:WARN No appenders could be found for logger (com.sas.hadoop.ep.TaskAgent$ShutdownHook).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Log Type: stdout
Log Upload Time: Tue Mar 13 16:26:31 -0400 2018
Log Length: 195

20180313:16.26.22.67: 00000010:WARNING: [01S02]Current catalog set to SASEP (0x80fff8bd)
20180313:16.26.22.79: 00000017:NOTE: All Embedded Process DS2 execution instances completed with SUCCESS.

Log Type: syslog
Log Upload Time: Tue Mar 13 16:26:31 -0400 2018
Log Length: 9885
Showing 4096 bytes of 9885 total. Click here for the full log.
22,666 INFO [main] com.sas.hadoop.ep.EPNativeInterface: SAS Embedded Process task is now connected.
2018-03-13 16:26:22,699 INFO [main] com.sas.hadoop.ep.EPNativeInterface: DS2 program is now prepared and ready for execution.
2018-03-13 16:26:22,702 INFO [main] com.sas.hadoop.ep.EPNativeInterface: Task ID 1 does not write output metadata.
2018-03-13 16:26:22,712 INFO [DonAlvaro] com.sas.hadoop.ep.Alvaro: Alvaro has been invoked to guide current task execution.
 …….

Data filter support

Data filter is supported while loading SAS HDMD table to CAS using SAS Data Connector for Hadoop or SAS Data Connect Accelerator for Hadoop. The following SAS log describe the data load from a HDMD table to CAS with data filter condition.

 73         proc casutil;
 NOTE: The UUID '418734c4-1480-d540-9a20-a001a79fc762' is connected using session MYSESSION.
 74         load casdata="deptm" casout="deptm"
 75              incaslib="MyHdmd" outcaslib="MyHdmd"
 76              options={where="DeptId > 200 "}
 77              replace ;
 NOTE: Executing action 'table.loadTable'.
 NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
 NOTE: Cloud Analytic Services made the external data from deptm available as table DEPTM in caslib MyHdmd.
 NOTE: Action 'table.loadTable' used (Total process time):
 NOTE:       real time               29.167724 seconds

Data save from CAS to SAS HDMD table

You can save a CAS table to SAS HDMD table in a text file with default “^A” column delimiter. The data append from a CAS table to an existing HDMD table is not supported.

The following SAS log describe the data save from a CAS table to SAS HDMD table using Data Connect Accelerator.

 .....
 73         proc casutil;
 NOTE: The UUID 'bfdc4bbe-27ee-a947-864d-4dacf5ed31b2' is connected using session MYSESSION.
 74         save casdata="deptm"  casout="deptm_new"
 75              incaslib="MyHdmd" outcaslib="MyHdmd" replace ;
 NOTE: Executing action 'table.save'.
 NOTE: Performing parallel SaveTable action using SAS Data Connect Accelerator for Hadoop.
 NOTE: Cloud Analytic Services saved the file deptm_new in caslib MYHDMD.
 NOTE: Action 'table.save' used (Total process time):
 NOTE:       real time               30.204765 seconds
 .....
 ......

In Hadoop HDFS, notice a new folder with list of data files for saved HDMD table.

[hadoop@sascdh01 ~]$ hadoop fs -ls /user/utkuma/hdmd/data/deptm_new
Found 3 items
-rw-r--r--   3 utkuma sasusers         69 2018-03-14 15:35 /user/utkuma/hdmd/data/deptm_new/sasdata-2018-03-14-15-35-09-m-00000-SWCH-M00000-T1.dlv
-rw-r--r--   3 utkuma sasusers        138 2018-03-14 15:35 /user/utkuma/hdmd/data/deptm_new/sasdata-2018-03-14-15-35-10-m-00002-SWCH-M00002-T0.dlv
-rw-r--r--   3 utkuma sasusers        138 2018-03-14 15:35 /user/utkuma/hdmd/data/deptm_new/sasdata-2018-03-14-15-35-10-m-00005-SWCH-M00005-T1.dlv

The content from data file shows that it’s a delimited text file with default “^A” column delimiter.

  
[hadoop@sascdh01 ~]$ hadoop fs -cat /user/utkuma/hdmd/data/deptm_new/sasdata-2018-03-14-15-35-09-m-00000-SWCH-M00000-T1.dlv
323Accounts
324HR
325Sales
326Payroll
327MArketing
328Purchase

Note: To create an xml-based metadata for HDFS data file, you must have SAS/ACCESS interface to Hadoop on SAS Client (SAS Studio) to access Hadoop using Hadoop engine.