In my last article we learned which Hadoop distributions are supported as well as how the SAS Scalable Performance Data (SPD) Server stores data. In this post, we will explore how to create SPD Server tables on the Hadoop Distributed File System (HDFS).
Things to consider when creating SPD Server tables on Hadoop:
When the data is not encrypted and un-compressed, WHERE clause pushdown to MapReduce is supported
SPD Server Access Control Lists to secure the data is supported
SPD Server Cluster Tables, a personal favorite format of mine, are supported
SPD Server Cluster Tables WHERE clause pushdown is on the roadmap
SAS administrator
In order for us to store SPD Server data on HDFS, the SAS administrator needs to add a domain to our LIBNAMES.PARM configuration file as well as a few parameters to the SPDSSERV.PARM configuration file.
Let’s review figure 1. The first three domains point to a cluster file system in a SAS Grid Manager environment. The fourth domain, LIBNAME=spdshdfs, is pointing to the HDFS path we use to store our data. Like a clustered file system or SAN storage, the only Hadoop user id needing READ, WRITE, ALTER, CONTROL access to the HDFS path is the user id that started SPD Server.
Figure 1. LIBNAMES.PARM
Let’s review the parameters of the SPDSERV.PARM file in figure 2.
HADOOPXML=- and HADOOPJAR= point to the XML and JAR files that were created when we ran the SAS Software Deployment wizard for SAS Access Interface to Hadoop.
HADOOPWORKPATH= specifies the path to the directory in the Hadoop Distributed File System that stores the temporary results of the MapReduce output
HADOOPACCELJVER= is the version of Java on the Hadoop cluster. Valid values are 1.6, 1.7, and 1.8.
HADOOPACCELWH will attempt to push all SPD Server WHERE clauses to MapReduce
Figure 2. SPDSERV.PARM
SAS programmers
We will start by reviewing our SPD Server LIBNAME statement:
LIBNAME source SASSPDS IP=YES DISCONNECT=YES LIBGEN=YES
HOST='localhost'
SERV='5400'
SCHEMA='spdshdfs'
USER='sasss1'
PASSWORD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX;
NOTE: User sasss1(ACL Group ADMGRP) connected to SPD(LAX) 5.2 HF(1) server at 10.36.150.120.
NOTE: Libref SOURCE was successfully assigned as follows:
Engine: SASSPDS
Physical Name: :29121/user/sas/spds/speedy/
MYSPDS is the libref we will reference in our SAS code to process the SPD Server data stored on HDFS
SASSPDS is the engine SPD Server uses to process data stored on HDFS
HOST='localhost' is the name of the computer where SPD Server is running
SCHEMA='spdshdfs' is referencing the SPD Server domain defined to LIBNAMES.PARM, see figure 1. The path defined to this domain is the location on HDFS that we will store our SPD Server data
USER= is our SPD Server user id
PASSWORD= is our SPD Server password
IP=YES will attempt to push all PROC SQL down to SPD Server for processing
DISCONNECT=YES specifies when to close network connections between the SAS client and the SPD Server. Closure can occur after all librefs are cleared or at the end of a SAS session
LIBGEN=YES option is used in explicit SQL connection statements. When you set LIBGEN= yes, SPD Server is configured to generate additional domain connections that enable you to perform SQL joins across different SPD Server domains.
Creating a compressed SPD Server Table:
DATA MYSPDS.&source (COMPRESS=BINARY);
ARRAY x{100};
ARRAY c{100};
DO i=1 TO 625000 * &Num_GB;
DO j=1 TO 100;
x{j}=RANUNI(2);
c{j}= INT(RANUNI (1)*4);
END;
y= INT(RANUNI (1)*2);
joinvalue=INT(RANUNI (1)*20);
OUTPUT;
END;
RUN;
NOTE: The data set MYSPDS.TESTDATA has 62500000 observations and 204 variables.
NOTE: Compressing data set MYSPDS.TESTDATA decreased size by 33.35 percent.
NOTE: DATA statement used (Total process time):
real time 24:20.03
cpu time 10:26.28
Creating an un-compressed SPD Server Table using PROC APPEND:
PROC APPEND BASE=SPDS.TESTDATA data=BDAT.TESTDATA;
RUN;
NOTE: Appending BDAT.TESTDATA to SPDS.TESTDATA.
NOTE: BASE data set does not exist. DATA file is being copied to BASE file.
INFO: Engine's block-read method cannot be used because the file is compressed
NOTE: There were 62500000 observations read from the data set BDAT.TESTDATA.
NOTE: The data set SPDS.TESTDATA has 62500000 observations and 204 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 11:34.68
cpu time 9:29.99
To improve I/O operation performance, consider setting a different SPD Server I/O block size. Keep in mind that the larger the block size (i.e. 64K) the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size (i,e. 8K) performs better. You can specify a different block size with the IOBLOCKSIZE= LIBNAME statement option, or the IOBLOCKSIZE= data set option when you create the SPD Server table.
SPD Server uses access control lists (ACLs) and SPD Server user IDs to secure domain resources. You obtain your user ID and password from your SPD Server administrator.
SPD Server supports ACL groups, which are similar to UNIX groups. SPD Server administrators can associate an SPD Server user with as many as 32 different ACL groups.
ACL file security is turned on by default when an administrator starts an SPD Server. ACL permissions affect all SPD Server resources, including domains, tables, table columns, catalogs, catalog entries, and utility files. When ACL file security is enabled, SPD Server grants access rights only to the owner (creator) of an SPD Server resource. Resource owners can use PROC SPDO to grant ACL permissions to a specific group (ACL group) or to all SPD Server users.
The resource owner can use the following properties to grant ACL permissions to all SPD Server users:
READ
Universal Read access to the resource (read or query)
WRITE
Universal Write access to the resource (append to or update)
ALTER
Universal Alter access to the resource (rename, delete, or replace a resource, and add or delete indexes associated with a table)
The resource owner can use the following properties to grant ACL permissions to a named ACL group:
GROUPREAD
Group Read access to the resource (read or query)
GROUPWRITE
Group Write access to the resource (append to or update)
GROUPALTER
Group Alter access to the resource (rename, delete, or replace a resource, and add or delete indexes associated with a table)
A note about compression and encryption: both are supported using a data set option. Encryption is for data in transit as well as at rest. Note, you cannot compress and encrypt a table; you can only compress or encrypt. If you compress or encrypt, the SPD Server table WHERE process cannot be pushed down to MapReduce.
Speaking of reading SPD Server data stored on HDFS, stay tuned, in my next post we will explore how to read a SPD Server table on HDFS.
... View more