How to read SAS Scalable Performance Data Server tables on the Hadoop Distributed File System

3 Likes

In my last article we learned how to create a SAS Scalable Performance Data Server (SPD Server) table. In this post, we will explore how to read that SPD Server table stored on the Hadoop Distributed File System (HDFS).

The Hadoop environment I have access to is small and not a candidate to run official benchmarks on. I am testing with Cloudera - CDH 5.4.5 on 5 Linux Servers (RHEL 6) with 4 of the servers acting as data nodes. Each server has 64GB RAM with Intel(R) Xeon(R) CPU E7-4880 v2 @ 2.50GHz (4 cores) processors. Each server has 251 gigabytes of local storage and for HDFS, we have 3.0 terabytes (NFS mount).

Suffice to say in the real world, Hadoop clusters are much bigger. The customers I have worked with tend to have 100 to 1,000 data nodes. In these environments, source and target tables are in the hundreds of gigabytes if not terabytes.

Un-compressed data

When SPD Server data is not compressed, we can push our WHERE statements down to MapReduce by ensuring the SAS administrator has added the parameter HADOOPACCELWH to the SPDSERV.PARM configuration file. We can validate the location of the processing of WHERE statements by setting the SPD Server macro SPDSWDEB=YES:

105       +DATA MYSPDS.&source.3;
106       +   SET MYSPDS.&source;
107       +   WHERE joinvalue < 3;
whinit: WHERE (joinvalue<3)
whinit: wh-tree presented
           /-NAME = [joinvalue]
 --CLT----|
           \-LITN = [3]
WHERE processing Hadoop cluster optimization candidate
whinit returns: ALL EVAL6
108       +RUN;
WHERE processing is optimized on the Hadoop cluster
NOTE: There were 9375485 observations read from the data set SOURCE.TESTDATA.
      WHERE joinvalue<3;
NOTE: The data set SOURCE.TESTDATA3 has 9375485 observations and 204 variables.
NOTE: DATA statement used (Total process time):
      real time           10:55.34
      cpu time            58.80 seconds

111       +PROC SQL;
112       +   CREATE TABLE MYSPDS.&source._sql AS
113       +   SELECT x1, x3, x5, joinvalue FROM source.&source
114       +   WHERE joinvalue < 3;
whinit: WHERE (joinvalue<3)
whinit: wh-tree presented
           /-NAME = [joinvalue]
 --CLT----|
           \-LITN = [3]
WHERE processing Hadoop cluster optimization candidate
whinit returns: ALL EVAL6
WHERE processing is optimized on the Hadoop cluster
NOTE: Table SOURCE.TESTDATA_SQL created, with 9375485 rows and 4 columns.
115       +QUIT;
NOTE: PROCEDURE SQL used (Total process time):
      real time           9:36.21
      cpu time            2.72 seconds

All SAS In-Memory technologies will “front load” the SPD Server data into the SAS High Performance Analytical Server. By “front load” we mean the data must leave the Hadoop cluster and flow through the SAS Server in to the memory of the SAS High Performance Analytics Server:

153       +PROC HPLOGISTIC DATA=MYSPDS.&source;
154       +PERFORMANCE NODES=ALL DETAILS;
155       +CLASS j / PARAM= GLM;
156       +     WEIGHT c98 ;
157       +MODEL y (EVENT = "0" ) = /  LINK= LOGIT;
158       +RUN;

NOTE: No explanatory variables have been specified.
NOTE: The HPLOGISTIC procedure is executing in the distributed computing environment with 4 worker nodes.
NOTE: You are modeling the probability that y='0'.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 62500000 observations read from the data set MYSPDS.TESTDATA.
NOTE: The PROCEDURE HPLOGISTIC printed pages 3-4.
NOTE: PROCEDURE HPLOGISTIC used (Total process time):
      real time           12:05.23
      cpu time            10.08 seconds

Compressed data

When SPD Server data is compressed, we cannot push our WHERE statements down to MapReduce. For the DATA step test, the compressed real time was 04:34.05 quicker. For the PROC SQL test, the compressed real time was 04:37.70 quicker. These quicker run times are a reflection of the small Hadoop cluster I have access to, with large Hadoop clusters this may not be the case:

108       +DATA MYSPDS.&source.3(COMPRESS=BINARY);
109       +   SET MYSPDS.&source;
110       +   WHERE joinvalue < 3;
whinit: WHERE (joinvalue<3)
whinit: wh-tree presented
           /-NAME = [joinvalue]
 --CLT----|
           \-LITN = [3]
whinitj: cannot do where pushdown; table compressed
whinit returns: ALL EVAL2
111       +RUN;
NOTE: There were 9375485 observations read from the data set MYSPDS.TESTDATA.
      WHERE joinvalue<3;
NOTE: The data set MYSPDS.TESTDATA3 has 9375485 observations and 204 variables.
NOTE: Compressing data set MYSPDS.TESTDATA3 decreased size by 33.37 percent.
NOTE: DATA statement used (Total process time):
      real time           6:21.29
      cpu time            1:00.19
114       +PROC SQL;
115       +   CREATE TABLE MYSPDS.&source._sql(COMPRESS=BINARY) AS
116       +   SELECT x1, x3, x5, joinvalue FROM MYSPDS.&source
117       +   WHERE joinvalue < 3;
whinit: WHERE (joinvalue<3)
whinit: wh-tree presented
           /-NAME = [joinvalue]
 --CLT----|
           \-LITN = [3]
whinitj: cannot do where pushdown; table compressed
whinit returns: ALL EVAL2
NOTE: Compressing data set MYSPDS.TESTDATA_SQL decreased size by 11.12 percent.
NOTE: Table MYSPDS.TESTDATA_SQL created, with 9375485 rows and 4 columns.
118       +QUIT;
NOTE: PROCEDURE SQL used (Total process time):
      real time           4:58.51
      cpu time            2.77 seconds

For our PROC HPLOGISTIC test, the compressed SPD Server table real time of 00:10.03.81 was 00:02:01.42 quicker when compared to un-compressed SPD Server real time of 00:12:05.23:

156       +PROC HPLOGISTIC DATA=MYSPDS.&source;
157       +PERFORMACE NODES=ALL DETAILS; 
158       +CLASS j / PARAM= GLM;
159       +     WEIGHT c98 ;
160       +MODEL y (EVENT = "0" ) = /  LINK= LOGIT;
161       +RUN;
NOTE: No explanatory variables have been specified.
NOTE: The HPLOGISTIC procedure is executing in the distributed computing environment with 4 worker nodes.
NOTE: You are modeling the probability that y='0'.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 62500000 observations read from the data set MYSPDS.TESTDATA.
NOTE: The PROCEDURE HPLOGISTIC printed pages 3-4.
NOTE: PROCEDURE HPLOGISTIC used (Total process time):
      real time           10:03.81
      cpu time            9.76 seconds

ronan · ‎05-20-2016

Thanks a lot for sharing. This is very insightful and informative also. Could you, please, elaborate a little bit further about the "front loading" processing required when SPD Server is feeding a HPA procedure on Hadoop (TKGrid).

| By “front load” we mean the data must leave the Hadoop cluster and flow through the SAS Server in to the memory of the SAS High

| Performance Analytics Server:

Is this the same as a "through the client" access, namely the input data must be processed by the SAS ~~workspace~~ SPDS (?) calling session instead of an "along the cluster" process like Hadoop (SPD on HDFS) --> Hadoop (TKGrid) ?

SteveSober · ‎05-20-2016

Good question @ronan. Yes “front loading” means the SPD Server data flows through the SAS server (client) into the memory of the SAS High Performance Analytical Server. Note, there is no processing of the data as it flows through the SAS server (client). It is simply the path the data has to take to end up in the memory of the SAS High Performance Analytical Server.

ronan · ‎05-24-2016

Thats clear, thanks.

How to read SAS Scalable Performance Data Server tables on the Hadoop Distributed File System

The 2025 SAS Hackathon has begun!

SAS AI and Machine Learning Courses