We’re smarter together. Learn from this collection of community knowledge and add your expertise.

How to create SAS Scalable Performance Data Engine tables on the Hadoop Distributed File System

by SAS Employee SteveSober on ‎11-30-2015 01:43 PM - edited on ‎01-19-2016 04:48 PM by Community Manager (2,118 Views)

In a recent article, we learned which Hadoop distributions are supported as well as how the SAS Scalable Performance Data (SPD) Engine stores data. In this post, part of the SAS Data Management for Hadoop article serieswe will explore how to create SPD Engine tables on the Hadoop Distributed File System (HDFS).

 

Let’s start by reviewing the options in the following SPD Engine LIBNAME statement

 

LIBNAME myspde SPDE '/user/sasss1' HDFSHOST=DEFAULT;
  1. '/user/sasss1' is the path on HDFS that our SPD Engine data will be written to.
  2. HDFSHOST=DEFAULT will default to the Hadoop system that the following SAS/Access Interface to Hadoop OPTIONS point to
options set=SAS_HADOOP_CONFIG_PATH=
options set=SAS_HADOOP_JAR_PATH=

In the third maintenance release for SAS 9.4, you can now request parallel processing for all Write operations in HDFS. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data. To request parallel processing for Write operations, use the PARALLELWRITE= LIBNAME statement option, or the PARALLELWRITE= data set option. Note, the data set option will override the LIBNAME statement option. Also note, there are restrictions when writing data in parallel. For one, we cannot use parallel processing for a Write operation and also request to create an index. Nor can we use parallel processing for a Write operation and also request BY-group processing or sorting.

 

PARTSIZE= is the size of the data partition file in megabytes, gigabytes, or terabytes. If n is specified without M, G, or T, the default is megabytes. That is, partsize=64 is the same as partsize=64m. Each partition is stored as a separate file with the file extension .dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file. To request partition size use the PARTSIZE= LIBNAME statement option, or the data set option PARTSIZE=. Note, the data set option will override the LIBNAME statement option.

 

This is an example of creating a SPD Engine table using the data set options PARALLELWRITE= and PARTSIZE=.

 

SPDE.Blog3.image1.png

 

 

When we look on HDFS, see figure 1, we see that we have created two items. A SPD Engine metadata file, i.e., testdata.mdf.0.0.0.spds9, and a directory that has a naming convention of “_spde” appended to the table name, i.e., testdata_spde.

 

SPDE.Blog3.image2 (2).png

Figure 1. Results of the Hadoop command: hadoop fs –ls /user/sasss1

 

When we look in the directory /user/sasss1/testdata_spde, see figure 2, we see the data partitions for our table.

 

SPDE.Blog3.image3.png

Figure 2. Results of the Hadoop command: hadoop fs –ls /user/sasss1/testdata_spde

 

The target table we created is 20GB in size and we set the partition size to 4GB. You will notice, in figure 2, there are 4 partition files that are 4GB each while 4 other partition files are less than 1GB. This is because we told SPD Engine to write the data in parallel. The SAS server I ran this code on has 4 cores so 4 threads are writing to HDFS in parallel. The first set of 4 threads wrote data in parallel until they hit the 4GB partition size. At this point, 16/20 of the table has been created. The second set of 4 threads only had 4 GB left to write. That is why there are 4 data partitions with sizes close to 1GB.   

 

When indexing a SPD Engine tables stored on HDFS, remember to use the LIBNAME statement option or data set option PARALLELWRITE=NO. PARALLELWRITE=NO specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine. Note, to enable asynchronous parallel index creation, use the ASYNCINDEX=YES data set option when you create the SPD Engine table.

 

To improve I/O operation performance, consider setting a different SPD Engine I/O block size. Keep in mind that the larger the block size, the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size performs better. You can specify a different block size with the IOBLOCKSIZE= LIBNAME statement option, or the IOBLOCKSIZE= data set option when you create the SPD Engine table.

 

A note about compression and encryption: both are supported using a data set option. Encryption is for data in transit as well as at rest. Note, you cannot compress and encrypt a table; you can only compress or encrypt. If you compress or encrypt, the SPD Engine table WHERE process cannot be pushed down to MapReduce.

 

Speaking of reading SPD Engine data stored on HDFS, stay tuned, in my next post we will explore how to read a SPD Engine table on HDFS.

 

Here are links to other posts in the SAS Data Management for Hadoop series for reference:

 

 

Comments
by Super User
on ‎12-04-2015 10:18 AM

Great sum up, and some good news.

But Ididn't really follow the PARALLELEWRITE and index creation relationship.

What will happen if I specify PARALLELEWRITE=YES and ASYNCINDEX=YES?

Can I first create the table using ASYNCINDEX=YES, and do subsequent inserts using PARALLELEWRITE =YES...?

Oh, I see, multiple parallel threads cannot update the same index file - right?

by SAS Employee SteveSober
‎12-04-2015 01:57 PM - edited ‎12-04-2015 04:25 PM

Great questions, LinusH.  

 

When you specify PARALLELWRITE=YES and ASYNCINDEX=YES in your code it will run successfully provided you do not try to index: 

 

43        +%let partsize=4G;
44        +
45        +DATA SPDE.MYTABLE (PARALLELWRITE=YES
46        +                    ASYNCINDEX=YES
47        +                    PARTSIZE=&partsize
48        +                   );
49        +        array x{100};
50        +        array c{100};
51        +        do i=1 to 625000 * &Num_GB;
52        +          do j=1 to 100;
53        +                x{j}=ranuni(2);
54        +                c{j}=int(ranuni(1)*4);
55        +         end;
56        +          y=int(ranuni(1)*2);
57        +          joinvalue=int(ranuni(1)*20);
58        +          output;
59        +        end;
60        +run;

INFO: Parallel write processing is being performed using 4 threads.
NOTE: The data set SPDE.MYTABLE has 625000 observations and 204 variables.
NOTE: DATA statement used (Total process time):
      real time           5.58 seconds
      cpu time            4.66 seconds

If you try to index (review line 48) you will get an error:

 

43        +%let partsize=4G;
44        +
45        +DATA SPDE.MYTABLE (PARALLELWRITE=YES
46        +                    ASYNCINDEX=YES
47        +                    PARTSIZE=&partsize
48        +                    INDEX=(i j)
49        +                   );
50        +        array x{100};
51        +        array c{100};
52        +        do i=1 to 625000 * &Num_GB;
53        +          do j=1 to 100;
54        +                x{j}=ranuni(2);
55        +                c{j}=int(ranuni(1)*4);
56        +          end;
57        +          y=int(ranuni(1)*2);
58        +          joinvalue=int(ranuni(1)*20);
59        +          output;
60        +        end;
61        +run;

ERROR: You cannot process data set SPDE.MYTABLE with the PARALLELWRITE= option and also request to create an index.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: SAS set option OBS=0 and will continue to check statements. This might cause NOTE: No observations in data set.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds 

 

A couple of points about creating indexes on a SPD Engine table. Always create simple indexes and never create composite indexes. The reasons for this are, SPD Engine will build the composite index on the fly using the simple indexes and if you build a composite index and do not use all of the variables of the composite index in your WHERE clause, no index will be used when sub setting the data.

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.