In my last article we learned which Hadoop distributions are supported as well as how the SAS Scalable Performance Data (SPD) Server stores data. In this post, we will explore how to create SPD Server tables on the Hadoop Distributed File System (HDFS).
Things to consider when creating SPD Server tables on Hadoop:
SAS administrator
In order for us to store SPD Server data on HDFS, the SAS administrator needs to add a domain to our LIBNAMES.PARM configuration file as well as a few parameters to the SPDSSERV.PARM configuration file.
Let’s review figure 1. The first three domains point to a cluster file system in a SAS Grid Manager environment. The fourth domain, LIBNAME=spdshdfs, is pointing to the HDFS path we use to store our data. Like a clustered file system or SAN storage, the only Hadoop user id needing READ, WRITE, ALTER, CONTROL access to the HDFS path is the user id that started SPD Server.
Figure 1. LIBNAMES.PARM
Let’s review the parameters of the SPDSERV.PARM file in figure 2.
Figure 2. SPDSERV.PARM
SAS programmers
We will start by reviewing our SPD Server LIBNAME statement:
LIBNAME source SASSPDS IP=YES DISCONNECT=YES LIBGEN=YES
HOST='localhost'
SERV='5400'
SCHEMA='spdshdfs'
USER='sasss1'
PASSWORD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX;
NOTE: User sasss1(ACL Group ADMGRP) connected to SPD(LAX) 5.2 HF(1) server at 10.36.150.120.
NOTE: Libref SOURCE was successfully assigned as follows:
Engine: SASSPDS
Physical Name: :29121/user/sas/spds/speedy/
Creating a compressed SPD Server Table:
DATA MYSPDS.&source (COMPRESS=BINARY);
ARRAY x{100};
ARRAY c{100};
DO i=1 TO 625000 * &Num_GB;
DO j=1 TO 100;
x{j}=RANUNI(2);
c{j}= INT(RANUNI (1)*4);
END;
y= INT(RANUNI (1)*2);
joinvalue=INT(RANUNI (1)*20);
OUTPUT;
END;
RUN;
NOTE: The data set MYSPDS.TESTDATA has 62500000 observations and 204 variables.
NOTE: Compressing data set MYSPDS.TESTDATA decreased size by 33.35 percent.
NOTE: DATA statement used (Total process time):
real time 24:20.03
cpu time 10:26.28
Creating an un-compressed SPD Server Table using PROC APPEND:
PROC APPEND BASE=SPDS.TESTDATA data=BDAT.TESTDATA; RUN; NOTE: Appending BDAT.TESTDATA to SPDS.TESTDATA. NOTE: BASE data set does not exist. DATA file is being copied to BASE file. INFO: Engine's block-read method cannot be used because the file is compressed
NOTE: There were 62500000 observations read from the data set BDAT.TESTDATA. NOTE: The data set SPDS.TESTDATA has 62500000 observations and 204 variables. NOTE: PROCEDURE APPEND used (Total process time): real time 11:34.68 cpu time 9:29.99
To improve I/O operation performance, consider setting a different SPD Server I/O block size. Keep in mind that the larger the block size (i.e. 64K) the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size (i,e. 8K) performs better. You can specify a different block size with the IOBLOCKSIZE= LIBNAME statement option, or the IOBLOCKSIZE= data set option when you create the SPD Server table.
SPD Server uses access control lists (ACLs) and SPD Server user IDs to secure domain resources. You obtain your user ID and password from your SPD Server administrator.
SPD Server supports ACL groups, which are similar to UNIX groups. SPD Server administrators can associate an SPD Server user with as many as 32 different ACL groups.
ACL file security is turned on by default when an administrator starts an SPD Server. ACL permissions affect all SPD Server resources, including domains, tables, table columns, catalogs, catalog entries, and utility files. When ACL file security is enabled, SPD Server grants access rights only to the owner (creator) of an SPD Server resource. Resource owners can use PROC SPDO to grant ACL permissions to a specific group (ACL group) or to all SPD Server users.
The resource owner can use the following properties to grant ACL permissions to all SPD Server users:
READ
Universal Read access to the resource (read or query)
WRITE
Universal Write access to the resource (append to or update)
ALTER
Universal Alter access to the resource (rename, delete, or replace a resource, and add or delete indexes associated with a table)
The resource owner can use the following properties to grant ACL permissions to a named ACL group:
GROUPREAD
Group Read access to the resource (read or query)
GROUPWRITE
Group Write access to the resource (append to or update)
GROUPALTER
Group Alter access to the resource (rename, delete, or replace a resource, and add or delete indexes associated with a table)
A note about compression and encryption: both are supported using a data set option. Encryption is for data in transit as well as at rest. Note, you cannot compress and encrypt a table; you can only compress or encrypt. If you compress or encrypt, the SPD Server table WHERE process cannot be pushed down to MapReduce.
Speaking of reading SPD Server data stored on HDFS, stay tuned, in my next post we will explore how to read a SPD Server table on HDFS.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.