Solved: Re: SPD Engine - Hadoop

msd83 · Posted 05-06-2014 06:53 AM

Hi all,

I have a SAS programme that accesses the datasets stored in Hadoop using the SPD Engine.

I have read in SAS documentation that when the Hadoop engine is used with Hive tables, some SAS functions can be passed through to Hadoop for processing.

Are there any functions that can be used when the SPD engine is used?

Many thanks all for your time

JBailey · Posted 05-06-2014 02:30 PM

The SPDE for HDFS documentation is available here --> SAS(R) 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

The big take-away is that the SPD Engine for HDFS does not use MapReduce and WHERE processing isn't fully implemented, yet. It is part of SAS Foundation which means you don't need SAS ACCESS. Keep in mind it is a LIBNAME engine, but not a SAS/ACCESS engine.

Many, perhaps all, of the example SPDE LIBNAME statements look like magic. For example, they don't mention any of the connection information for HDFS. In order to connect you will need to set two environment variables.

SAS_HADOOP_JAR_PATH= this is the location of the JAR files that SAS needs in order to connect to Hadoop. This is covered in the ACCESS manual. You will likely need to get these files from your Hadoop administrator.
SAS_HADOOP_CONFIG_PATH= You will need to copy the core-site.xml, hdfs-site.xml and mapred-site.xml files onto the machine where SAS is running. Put them in a directory and point this environment variable to that directory. You will likely need to get these files from your Hadoop administrator.

View solution in original post

LinusH · Posted 05-06-2014 08:32 AM

I don't have any truth to offer here, and I'm also curious to hear some more details from the Institute, since the documentation about this configuration is quite scant.

In the meanwhile, I'll take the liberty to elaborate around the subject.

What the SPDE do is:

offer parallelism for I/O reads and index updates
Parallel where evaluation
Implicit parallel sorting

So, any other calculations such as grouping is not passed to SPDE, it's done in the SAS process. And, SPDE is not a data base engine which you could execute pass-thru queries to.

So I wouldn't expect that any calculation will be passed to Hadoop. The only possible functions is those that are supported for parallel WHERE-evaluation.

Data never sleeps

msd83 · Posted 05-06-2014 08:38 AM

Hi LinusH,

Thanks for the explanation. Thanks for confirming that SPDE cannot be using for pass-through queries.

I will need to use the Hadoop engine for that.

LinusH · Posted 05-06-2014 08:44 AM

If you have a license for SAS/ACCESS to Hadoop and a need for a full blown MPP implementation, this would be the best choice.

I haven't tried either of them in a sharp production environment yet, so this is my best guess.

If you have a use case, it would be fairly easy to see the difference in load/query performance.

Data never sleeps

msd83 · Posted 05-06-2014 09:11 AM

I have got a license and it is something I will be exploring. Thanks again.

JBailey · Posted 05-06-2014 02:30 PM

The SPDE for HDFS documentation is available here --> SAS(R) 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

The big take-away is that the SPD Engine for HDFS does not use MapReduce and WHERE processing isn't fully implemented, yet. It is part of SAS Foundation which means you don't need SAS ACCESS. Keep in mind it is a LIBNAME engine, but not a SAS/ACCESS engine.

Many, perhaps all, of the example SPDE LIBNAME statements look like magic. For example, they don't mention any of the connection information for HDFS. In order to connect you will need to set two environment variables.

SAS_HADOOP_JAR_PATH= this is the location of the JAR files that SAS needs in order to connect to Hadoop. This is covered in the ACCESS manual. You will likely need to get these files from your Hadoop administrator.
SAS_HADOOP_CONFIG_PATH= You will need to copy the core-site.xml, hdfs-site.xml and mapred-site.xml files onto the machine where SAS is running. Put them in a directory and point this environment variable to that directory. You will likely need to get these files from your Hadoop administrator.

LinusH · Posted 05-06-2014 03:05 PM

I've read it, and as I said, it does not clarify which features that are supported (unless the "hard" ones, the support for different ds options and file types).

Can you elaborate more about where clauses, what does "isn't fully implemented" mean?

Parallel update of indexes - that must be supported, right? Which in my world would mean MPP wise.

And what about implicit sorting - does that take place in hdfs, or in the SAS client?

Data never sleeps

JBailey · Posted 05-06-2014 03:22 PM

With regards to WHERE clauses not being fully implemented. It means that WHERE clauses do not currently pass down to HDFS. This functionality is planned for the next release of SAS (sometime in Q3). In order to answer your last two questions, I will need to do some research. I don't know the answers.

LinusH · Posted 05-07-2014 03:05 AM

Please, do, I would appreciate it much!

Since if neither where-clase nor sorting is pushed down to hdfs, I can't really see the point of this feature. But Q3 isn't far away...

Data never sleeps

msd83 · Posted 05-07-2014 05:09 AM

Thanks all for the useful information, very helpful.

DWarner · Posted 11-17-2015 10:26 AM

In the third maintenance release for SAS 9.4, WHERE processing optimization is expanded. Using the Base SAS SPD Engine with Hadoop, you can request that data subsetting be performed in the Hadoop cluster, which takes advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of data is returned to the SAS client.

By default, data subsetting is performed by the SPD Engine on the SAS client. To request that data subsetting be performed in the Hadoop cluster, you must specify the ACCELWHERE= LIBNAME statement or the ACCELWHERE= data set option.

WHERE processing optimization supports the following syntax:

comparison operators such as EQ (=), NE (^=), GT (>), LT (<), GE (>=), LE (<=)
IN operator
full bounded range condition, such as where 500 <= empnum <= 1000;
BETWEEN-AND operator, such as where empnum between 500 and 1000;
compound expressions using the logical operators AND, OR, and NOT, such as where skill = 'java' or years = 4;
parentheses to control the order of evaluation, such as where (product='GRAPH' or product='STAT') and country='Canada';

For the complete documentation about WHERE processing optimization and the data set and SAS code requirements, see WHERE Processing Optimization with MapReduce.

Registration is open