With SAS Viya 3.3, the SAS Data Connector for Hadoop and SAS Data Connect Accelerator for Hadoop can read data from SAS HDMD tables stored in HDFS, in addition to Hive tables. A SAS HDMD table is an XML-Based metadata generated by PROC HDMD that describe the contents of a file stored in HDFS.
This enables you to read data from delimited, Binary, and XML formatted files stored in HDFS (Hadoop) and load it to CAS using serial or parallel mode. You can use SAS Data connect accelerator and SAS EP to parallel load CAS table from HDMD tables.
The SAS Data Connector and Data Connect Accelerator for Hadoop are part of SAS Viya version of SAS/ACCESS interface to Hadoop, which is used for serial and parallel data read and write between CAS and various data servers.
The SAS PROC HDMD generates XML-Based metadata that describes the contents of a data file stored in HDFS. The file-type for an XML-based metadata generated by PROC HDMD is SASHDMD (i.e. cust_table.sashdmd). This metadata, also known as SASHDMD descriptor, is stored in HDFS. This metadata enables the SAS/ACCESS Interface to Hadoop and the SAS high-performance procedure to read Hadoop data directly without an intermediate metadata repository such as Hive.
When HDFS_METADIR= specified in LIBANME connection option, SAS/ACCESS does not connect to Hive, but instead looks for an XML-based HDMD metadata file. When SAS PROC HDMD is used with a Hadoop LIBNAME, it creates an SASHDMD file containing metadata.
The SAS PROC HDMD can be used to describe the columns in an HDFS data file or an HDFS folder containing a list of data files with same file format. When it describes one HDFS file, the SASHDMD descriptor contains the complete file path. When it describes an HDFS folder, the SASHDMD descriptor contains an HDFS directory path:
PROC HDMD can be used to describe tabular HDFS files for the following formats:
Examples:
The following example describe the creation of a HDMD metadata for a “|” delimited HDFS file using PROC HDMD.
File location on HDFS: /user/utkuma/hdmd/data/pipedata_dept.txt
Data file contents:
123|Accounts
124|HR
125|Sales
126|Payroll
127|Marketing
128|Purchase
PROC HDMD statement:
OPTIONS SET=SAS_HADOOP_JAR_PATH="/opt/MyHadoop/CDH/Jars"; OPTIONS SET=SAS_HADOOP_CONFIG_PATH="/opt/MyHadoop/CDH/Config"; libname HDP hadoop user=utkuma server='sascdh01.race.sas.com' hdfs_metadir = '/user/utkuma/hdmd/meta' hdfs_datadir = '/user/utkuma/hdmd/data' ; proc hdmd name=hdp.pipedata_dept file_format=delimited sep = '|' DATA_FILE='pipedata_dept.txt' ; COLUMN DeptId int; COLUMN DeptName char(15); run;
The following example describes the creation of a HDMD metadata using PROC HDMD for a HDFS folder containing list of “|” delimited data files.
Folder and file location on HDFS:
[hadoop@sascdh01 ~]$ hadoop fs -ls /user/utkuma/hdmd/data/dept
Found 5 items
-rw-r--r-- 3 utkuma sasusers 75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept0.txt
-rw-r--r-- 3 utkuma sasusers 75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept1.txt
-rw-r--r-- 3 utkuma sasusers 75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept2.txt
-rw-r--r-- 3 utkuma sasusers 75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept3.txt
-rw-r--r-- 3 utkuma sasusers 75 2018-02-08 09:44 /user/utkuma/hdmd/data/dept/dept4.txt
PROC HDMD statement:
… …… proc hdmd name=hdp.deptM file_format=delimited sep = '|' DATA_DIR='dept' ; COLUMN DeptId int; COLUMN DeptName char(15); run;
To view the metadata of a HDMD table, you can use ‘describe’ option with PROC HDMD.
73 proc hdmd name=hdp.deptM describe;
74 run;
PROC HDMD NAME=HDP.DEPTM
FILE_FORMAT=DELIMITED ENCODING=UTF8 SEP='|' BYTE_ORDER=LITTLEENDIAN
FILE_TYPE=DELIMITED
DATA_DIR='/user/utkuma/hdmd/data/dept'
META_DIR='/user/utkuma/hdmd/meta';
COLUMN /* 1 */ DeptId INT OFFSET=0 BYTES=4 CTYPE=INT32;
COLUMN /* 2 */ DeptName CHAR(15) OFFSET=0 BYTES=15 CTYPE=CHAR;
NOTE: PROCEDURE HDMD used (Total process time):
For information on SAS PROC HDMD and available options with each file format, see the documentation.
The data from a SAS HDMD table can be loaded to CAS using Serial or Parallel method. For parallel load, you must have the SAS Data Connect Accelerator for Hadoop installed on CAS Controller and on each CAS Worker nodes along with SAS EP installed on Hadoop cluster nodes. You also need Hadoop client Jars and client Configuration files on CAS Controller node. For serial load you must have SAS Data Connector for Hadoop installed on CAS Controller node long with Hadoop client Jars and client Configuration files on CAS Controller node. You can also use the Multi-Node data transfer loading facility by installing Data Connector for Hadoop on each CAS Worker nodes along with the Hadoop client Jars and client Configuration files on each CAS worker nodes.
When defining a CASLIB with source type of Hadoop, you can specify HDFSMETADIR= and HDFSDATADIR= in connection options to enable the access of SAS HDMD metadata. When you specify HDFSMETADIR= in connection option, the Data Connector and Data Connect Accelerator does not connect to Hive, instead it looks for an XML-based HDMD metadata to access the HDFS data file.
The following code example describes the data load from a SAS HDMD table to CAS using “Parallel” data transfer mode. It assigns a CASLIB with parallel mode along with HDFSMETADIR= and HDFSDATADIR= options to enable the access of HDMD metadata. By using PROC CASUTIL and load table CAS action, it loads the data from HDMD table to CAS. You can notice the log says “Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop”.
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); caslib MyHdmd desc='HdmdCaslib' datasource=(srctype="hadoop", server="sascdh01.race.sas.com", username="utkuma", dataTransferMode="parallel", hadoopconfigdir="/opt/MyHadoop/CDH/Config", hadoopjarpath="/opt/MyHadoop/CDH/Jars", hdfsDataDir="/user/utkuma/hdmd/data", hdfsMetaDir="/user/utkuma/hdmd/meta"); proc casutil; load casdata="deptm" casout="deptm" incaslib="MyHdmd" outcaslib="MyHdmd" replace ; run; CAS mySession TERMINATE;
Log extract from SAS code execution:
….
84
85 proc casutil;
NOTE: The UUID '42936e10-c499-0e4a-a764-5e405a91b8e4' is connected using session MYSESSION.
86 load casdata="deptm" casout="deptm"
87 incaslib="MyHdmd" outcaslib="MyHdmd"
88 replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
NOTE: Cloud Analytic Services made the external data from deptm available as table DEPTM in caslib MyHdmd.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE: real time 29.974952 seconds
NOTE: cpu time 0.884905 seconds (2.95%)
NOTE: total nodes 8 (32 cores)
NOTE: total memory 250.11G
NOTE: memory 64.41M (0.03%)
NOTE: The Cloud Analytic Services server processed the request in 29.974952 seconds.
89 run;
…
..
If we look at the Hadoop cluster resource manager for corresponding job execution, it shows application executed with SAS EP.
If we look at the log from executed mapper process, it also shows that SAS EP was used while running data feed to CAS.
……..
log4j:WARN No appenders could be found for logger (com.sas.hadoop.ep.TaskAgent$ShutdownHook).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Log Type: stdout
Log Upload Time: Tue Mar 13 16:26:31 -0400 2018
Log Length: 195
20180313:16.26.22.67: 00000010:WARNING: [01S02]Current catalog set to SASEP (0x80fff8bd)
20180313:16.26.22.79: 00000017:NOTE: All Embedded Process DS2 execution instances completed with SUCCESS.
Log Type: syslog
Log Upload Time: Tue Mar 13 16:26:31 -0400 2018
Log Length: 9885
Showing 4096 bytes of 9885 total. Click here for the full log.
22,666 INFO [main] com.sas.hadoop.ep.EPNativeInterface: SAS Embedded Process task is now connected.
2018-03-13 16:26:22,699 INFO [main] com.sas.hadoop.ep.EPNativeInterface: DS2 program is now prepared and ready for execution.
2018-03-13 16:26:22,702 INFO [main] com.sas.hadoop.ep.EPNativeInterface: Task ID 1 does not write output metadata.
2018-03-13 16:26:22,712 INFO [DonAlvaro] com.sas.hadoop.ep.Alvaro: Alvaro has been invoked to guide current task execution.
…….
Data filter is supported while loading SAS HDMD table to CAS using SAS Data Connector for Hadoop or SAS Data Connect Accelerator for Hadoop. The following SAS log describe the data load from a HDMD table to CAS with data filter condition.
73 proc casutil; NOTE: The UUID '418734c4-1480-d540-9a20-a001a79fc762' is connected using session MYSESSION. 74 load casdata="deptm" casout="deptm" 75 incaslib="MyHdmd" outcaslib="MyHdmd" 76 options={where="DeptId > 200 "} 77 replace ; NOTE: Executing action 'table.loadTable'. NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop. NOTE: Cloud Analytic Services made the external data from deptm available as table DEPTM in caslib MyHdmd. NOTE: Action 'table.loadTable' used (Total process time): NOTE: real time 29.167724 seconds
You can save a CAS table to SAS HDMD table in a text file with default “^A” column delimiter. The data append from a CAS table to an existing HDMD table is not supported.
The following SAS log describe the data save from a CAS table to SAS HDMD table using Data Connect Accelerator.
..... 73 proc casutil; NOTE: The UUID 'bfdc4bbe-27ee-a947-864d-4dacf5ed31b2' is connected using session MYSESSION. 74 save casdata="deptm" casout="deptm_new" 75 incaslib="MyHdmd" outcaslib="MyHdmd" replace ; NOTE: Executing action 'table.save'. NOTE: Performing parallel SaveTable action using SAS Data Connect Accelerator for Hadoop. NOTE: Cloud Analytic Services saved the file deptm_new in caslib MYHDMD. NOTE: Action 'table.save' used (Total process time): NOTE: real time 30.204765 seconds ..... ......
In Hadoop HDFS, notice a new folder with list of data files for saved HDMD table.
[hadoop@sascdh01 ~]$ hadoop fs -ls /user/utkuma/hdmd/data/deptm_new Found 3 items -rw-r--r-- 3 utkuma sasusers 69 2018-03-14 15:35 /user/utkuma/hdmd/data/deptm_new/sasdata-2018-03-14-15-35-09-m-00000-SWCH-M00000-T1.dlv -rw-r--r-- 3 utkuma sasusers 138 2018-03-14 15:35 /user/utkuma/hdmd/data/deptm_new/sasdata-2018-03-14-15-35-10-m-00002-SWCH-M00002-T0.dlv -rw-r--r-- 3 utkuma sasusers 138 2018-03-14 15:35 /user/utkuma/hdmd/data/deptm_new/sasdata-2018-03-14-15-35-10-m-00005-SWCH-M00005-T1.dlv
The content from data file shows that it’s a delimited text file with default “^A” column delimiter.
[hadoop@sascdh01 ~]$ hadoop fs -cat /user/utkuma/hdmd/data/deptm_new/sasdata-2018-03-14-15-35-09-m-00000-SWCH-M00000-T1.dlv 323Accounts 324HR 325Sales 326Payroll 327MArketing 328Purchase
Note: To create an xml-based metadata for HDFS data file, you must have SAS/ACCESS interface to Hadoop on SAS Client (SAS Studio) to access Hadoop using Hadoop engine.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.