We’re smarter together. Learn from this collection of community knowledge and add your expertise.

How to create SAS Scalable Performance Data Server tables on the Hadoop Distributed File System

by SAS Employee SteveSober on ‎02-29-2016 11:11 AM (1,434 Views)

In my last article we learned which Hadoop distributions are supported as well as how the SAS Scalable Performance Data (SPD) Server stores data. In this post, we will explore how to create SPD Server tables on the Hadoop Distributed File System (HDFS).

 

Things to consider when creating SPD Server tables on Hadoop:

  • When the data is not encrypted and un-compressed, WHERE clause pushdown to MapReduce is supported
  • SPD Server Access Control Lists to secure the data is supported
  • SPD Server Cluster Tables, a personal favorite format of mine, are supported
  • SPD Server Cluster Tables WHERE clause pushdown is on the roadmap 

SAS administrator

In order for us to store SPD Server data on HDFS, the SAS administrator needs to add a domain to our LIBNAMES.PARM configuration file as well as a few parameters to the SPDSSERV.PARM configuration file.

Let’s review figure 1. The first three domains point to a cluster file system in a SAS Grid Manager environment. The fourth domain, LIBNAME=spdshdfs, is pointing to the HDFS path we use to store our data. Like a clustered file system or SAN storage, the only Hadoop user id needing READ, WRITE, ALTER, CONTROL access to the HDFS path is the user id that started SPD Server.

 

SPDS.CreateTable.Figure1.png

Figure 1. LIBNAMES.PARM

 

Let’s review the parameters of the SPDSERV.PARM file in figure 2.

  1. HADOOPXML=- and HADOOPJAR= point to the XML and JAR files that were created when we ran the SAS Software Deployment wizard for SAS Access Interface to Hadoop.
  2. HADOOPWORKPATH= specifies the path to the directory in the Hadoop Distributed File System that stores the temporary results of the MapReduce output
  3. HADOOPACCELJVER= is the version of Java on the Hadoop cluster. Valid values are 1.6, 1.7, and 1.8.
  4. HADOOPACCELWH will attempt to push all SPD Server WHERE clauses to MapReduce

 SPDS.CreateTable.Figure2.png

Figure 2. SPDSERV.PARM

 

SAS programmers

We will start by reviewing our SPD Server LIBNAME statement:

 

LIBNAME source  SASSPDS IP=YES DISCONNECT=YES LIBGEN=YES
   HOST='localhost'
   SERV='5400'
   SCHEMA='spdshdfs'
   USER='sasss1'
   PASSWORD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX;

NOTE: User sasss1(ACL Group ADMGRP) connected to SPD(LAX) 5.2 HF(1) server at 10.36.150.120.
NOTE: Libref SOURCE was successfully assigned as follows:
      Engine:        SASSPDS
      Physical Name: :29121/user/sas/spds/speedy/
  1. MYSPDS is the libref we will reference in our SAS code to process the SPD Server data stored on HDFS
  2. SASSPDS is the engine SPD Server uses to process data stored on HDFS
  3. HOST='localhost' is the name of the computer where SPD Server is running
  4. SCHEMA='spdshdfs' is referencing the SPD Server domain defined to LIBNAMES.PARM, see figure 1. The path defined to this domain is the location on HDFS that we will store our SPD Server data
  5. USER= is our SPD Server user id
  6. PASSWORD= is our SPD Server password
  7. IP=YES will attempt to push all PROC SQL down to SPD Server for processing
  8. DISCONNECT=YES specifies when to close network connections between the SAS client and the SPD Server. Closure can occur after all librefs are cleared or at the end of a SAS session
  9. LIBGEN=YES option is used in explicit SQL connection statements. When you set LIBGEN= yes, SPD Server is configured to generate additional domain connections that enable you to perform SQL joins across different SPD Server domains.

  Creating a compressed SPD Server Table:

 

DATA MYSPDS.&source (COMPRESS=BINARY);
     ARRAY x{100};
     ARRAY c{100};
     DO i=1 TO 625000 * &Num_GB;
        DO j=1 TO 100;
           x{j}=RANUNI(2);
           c{j}= INT(RANUNI (1)*4);
        END;
        y= INT(RANUNI (1)*2);
        joinvalue=INT(RANUNI (1)*20);
        OUTPUT;
     END;
RUN;
 
NOTE: The data set MYSPDS.TESTDATA has 62500000 observations and 204 variables.
NOTE: Compressing data set MYSPDS.TESTDATA decreased size by 33.35 percent.
NOTE: DATA statement used (Total process time):
      real time           24:20.03    
      cpu time            10:26.28

 

Creating an un-compressed SPD Server Table using PROC APPEND:

 

PROC APPEND BASE=SPDS.TESTDATA data=BDAT.TESTDATA;
RUN;
 
NOTE: Appending BDAT.TESTDATA to SPDS.TESTDATA.
NOTE: BASE data set does not exist. DATA file is being copied to BASE file.
INFO: Engine's block-read method cannot be used because the file is compressed
 
NOTE: There were 62500000 observations read from the data set BDAT.TESTDATA.
NOTE: The data set SPDS.TESTDATA has 62500000 observations and 204 variables.
NOTE: PROCEDURE APPEND used (Total process time):
 
      real time           11:34.68
      cpu time            9:29.99

 

To improve I/O operation performance, consider setting a different SPD Server I/O block size. Keep in mind that the larger the block size (i.e. 64K) the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size (i,e. 8K) performs better. You can specify a different block size with the IOBLOCKSIZE= LIBNAME statement option, or the IOBLOCKSIZE= data set option when you create the SPD Server table.

 

SPD Server uses access control lists (ACLs) and SPD Server user IDs to secure domain resources. You obtain your user ID and password from your SPD Server administrator.

 

SPD Server supports ACL groups, which are similar to UNIX groups. SPD Server administrators can associate an SPD Server user with as many as 32 different ACL groups.

 

ACL file security is turned on by default when an administrator starts an SPD Server. ACL permissions affect all SPD Server resources, including domains, tables, table columns, catalogs, catalog entries, and utility files. When ACL file security is enabled, SPD Server grants access rights only to the owner (creator) of an SPD Server resource. Resource owners can use PROC SPDO to grant ACL permissions to a specific group (ACL group) or to all SPD Server users.

 

The resource owner can use the following properties to grant ACL permissions to all SPD Server users:

 

READ

Universal Read access to the resource (read or query)

 

WRITE

Universal Write access to the resource (append to or update)

 

ALTER

Universal Alter access to the resource (rename, delete, or replace a resource, and add or delete indexes associated with a table)

 

The resource owner can use the following properties to grant ACL permissions to a named ACL group:

 

GROUPREAD

Group Read access to the resource (read or query)

 

GROUPWRITE

Group Write access to the resource (append to or update)

 

GROUPALTER

Group Alter access to the resource (rename, delete, or replace a resource, and add or delete indexes associated with a table)

 

A note about compression and encryption: both are supported using a data set option. Encryption is for data in transit as well as at rest. Note, you cannot compress and encrypt a table; you can only compress or encrypt. If you compress or encrypt, the SPD Server table WHERE process cannot be pushed down to MapReduce.

 

Speaking of reading SPD Server data stored on HDFS, stay tuned, in my next post we will explore how to read a SPD Server table on HDFS.

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.