In a recent article, we learned which Hadoop distributions are supported as well as how the SAS Scalable Performance Data (SPD) Engine stores data. In this post, part of the SAS Data Management for Hadoop article series, we will explore how to create SPD Engine tables on the Hadoop Distributed File System (HDFS).
Let’s start by reviewing the options in the following SPD Engine LIBNAME statement
LIBNAME myspde SPDE '/user/sasss1' HDFSHOST=DEFAULT;
options set=SAS_HADOOP_CONFIG_PATH= options set=SAS_HADOOP_JAR_PATH=
In the third maintenance release for SAS 9.4, you can now request parallel processing for all Write operations in HDFS. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data. To request parallel processing for Write operations, use the PARALLELWRITE= LIBNAME statement option, or the PARALLELWRITE= data set option. Note, the data set option will override the LIBNAME statement option. Also note, there are restrictions when writing data in parallel. For one, we cannot use parallel processing for a Write operation and also request to create an index. Nor can we use parallel processing for a Write operation and also request BY-group processing or sorting.
PARTSIZE= is the size of the data partition file in megabytes, gigabytes, or terabytes. If n is specified without M, G, or T, the default is megabytes. That is, partsize=64 is the same as partsize=64m. Each partition is stored as a separate file with the file extension .dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file. To request partition size use the PARTSIZE= LIBNAME statement option, or the data set option PARTSIZE=. Note, the data set option will override the LIBNAME statement option.
This is an example of creating a SPD Engine table using the data set options PARALLELWRITE= and PARTSIZE=.
When we look on HDFS, see figure 1, we see that we have created two items. A SPD Engine metadata file, i.e., testdata.mdf.0.0.0.spds9, and a directory that has a naming convention of “_spde” appended to the table name, i.e., testdata_spde.
Figure 1. Results of the Hadoop command: hadoop fs –ls /user/sasss1
When we look in the directory /user/sasss1/testdata_spde, see figure 2, we see the data partitions for our table.
Figure 2. Results of the Hadoop command: hadoop fs –ls /user/sasss1/testdata_spde
The target table we created is 20GB in size and we set the partition size to 4GB. You will notice, in figure 2, there are 4 partition files that are 4GB each while 4 other partition files are less than 1GB. This is because we told SPD Engine to write the data in parallel. The SAS server I ran this code on has 4 cores so 4 threads are writing to HDFS in parallel. The first set of 4 threads wrote data in parallel until they hit the 4GB partition size. At this point, 16/20 of the table has been created. The second set of 4 threads only had 4 GB left to write. That is why there are 4 data partitions with sizes close to 1GB.
When indexing a SPD Engine tables stored on HDFS, remember to use the LIBNAME statement option or data set option PARALLELWRITE=NO. PARALLELWRITE=NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. Note, to enable asynchronous parallel index creation, use the ASYNCINDEX=YES data set option when you create the SPD Engine table.
To improve I/O operation performance, consider setting a different SPD Engine I/O block size. Keep in mind that the larger the block size, the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size performs better. You can specify a different block size with the IOBLOCKSIZE= LIBNAME statement option, or the IOBLOCKSIZE= data set option when you create the SPD Engine table.
A note about compression and encryption: both are supported using a data set option. Encryption is for data in transit as well as at rest. Note, you cannot compress and encrypt a table; you can only compress or encrypt. If you compress or encrypt, the SPD Engine table WHERE process cannot be pushed down to MapReduce.
Speaking of reading SPD Engine data stored on HDFS, stay tuned, in my next post we will explore how to read a SPD Engine table on HDFS.
Here are links to other posts in the SAS Data Management for Hadoop series for reference: