SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

SPD Engine - Hadoop

Accepted Solution Solved
Reply
Contributor
Posts: 23
Accepted Solution

SPD Engine - Hadoop

Hi all,

I have a SAS programme that accesses the datasets stored in Hadoop using the SPD Engine.

I have read in SAS documentation that when the Hadoop engine is used with Hive tables, some SAS functions can be passed through to Hadoop for processing.

Are there any functions that can be used when the SPD engine is used?

Many thanks all for your time


Accepted Solutions
Solution
‎05-06-2014 02:30 PM
SAS Employee
Posts: 203

Re: SPD Engine - Hadoop

The SPDE for HDFS documentation is available here --> SAS(R) 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

The big take-away is that the SPD Engine for HDFS does not use MapReduce and WHERE processing isn't fully implemented, yet. It is part of SAS Foundation which means you don't need SAS ACCESS. Keep in mind it is a LIBNAME engine, but not a SAS/ACCESS engine.

Many, perhaps all, of the example SPDE LIBNAME statements look like magic. For example, they don't mention any of the connection information for HDFS. In order to connect you will need to set two environment variables.

  1. SAS_HADOOP_JAR_PATH= this is the location of the JAR files that SAS needs in order to connect to Hadoop. This is covered in the ACCESS manual. You will likely need to get these files from your Hadoop administrator.
  2. SAS_HADOOP_CONFIG_PATH= You will need to copy the core-site.xml, hdfs-site.xml and mapred-site.xml files onto the machine where SAS is running. Put them in a directory and point this environment variable to that directory. You will likely need to get these files from your Hadoop administrator.

View solution in original post


All Replies
Super User
Posts: 5,257

Re: SPD Engine - Hadoop

I don't have any truth to offer here, and I'm also curious to hear some more details from the Institute, since the documentation about this configuration is quite scant.

In the meanwhile, I'll take the liberty to elaborate around the subject.

What the SPDE do is:

  • offer parallelism for I/O reads and index updates
  • Parallel where evaluation
  • Implicit parallel sorting

So, any other calculations such as grouping is not passed to SPDE, it's done in the SAS process. And, SPDE is not a data base engine which you could execute pass-thru queries to.

So I wouldn't expect that any calculation will be passed to Hadoop. The only possible functions is those that are supported for parallel WHERE-evaluation.

Data never sleeps
Contributor
Posts: 23

Re: SPD Engine - Hadoop

Hi LinusH,

Thanks for the explanation. Thanks for confirming that SPDE cannot be using for pass-through queries.

I will need to use the Hadoop engine for that.

Super User
Posts: 5,257

Re: SPD Engine - Hadoop

If you have a license for SAS/ACCESS to Hadoop and a need for a full blown MPP implementation, this would be the best choice.

I haven't tried either of them in a sharp production environment yet, so this is my best guess.

If you have a use case, it would be fairly easy to see the difference in load/query performance.

Data never sleeps
Contributor
Posts: 23

Re: SPD Engine - Hadoop

I have got a license and it is something I will be exploring. Thanks again.

Solution
‎05-06-2014 02:30 PM
SAS Employee
Posts: 203

Re: SPD Engine - Hadoop

The SPDE for HDFS documentation is available here --> SAS(R) 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

The big take-away is that the SPD Engine for HDFS does not use MapReduce and WHERE processing isn't fully implemented, yet. It is part of SAS Foundation which means you don't need SAS ACCESS. Keep in mind it is a LIBNAME engine, but not a SAS/ACCESS engine.

Many, perhaps all, of the example SPDE LIBNAME statements look like magic. For example, they don't mention any of the connection information for HDFS. In order to connect you will need to set two environment variables.

  1. SAS_HADOOP_JAR_PATH= this is the location of the JAR files that SAS needs in order to connect to Hadoop. This is covered in the ACCESS manual. You will likely need to get these files from your Hadoop administrator.
  2. SAS_HADOOP_CONFIG_PATH= You will need to copy the core-site.xml, hdfs-site.xml and mapred-site.xml files onto the machine where SAS is running. Put them in a directory and point this environment variable to that directory. You will likely need to get these files from your Hadoop administrator.
Super User
Posts: 5,257

Re: SPD Engine - Hadoop

I've read it, and as I said, it does not clarify which features that are supported (unless the "hard" ones, the support for different ds options and file types).

Can you elaborate more about where clauses, what does "isn't fully implemented" mean?

Parallel update of indexes - that must be supported, right? Which in my world would mean MPP wise.

And what about implicit sorting - does that take place in hdfs, or in the SAS client?

Data never sleeps
SAS Employee
Posts: 203

Re: SPD Engine - Hadoop

With regards to WHERE clauses not being fully implemented. It means that WHERE clauses do not currently pass down to HDFS. This functionality is planned for the next release of SAS (sometime in Q3). In order to answer your last two questions, I will need to do some research. I don't know the answers.

Super User
Posts: 5,257

Re: SPD Engine - Hadoop

Please, do, I would appreciate it much!

Since if neither where-clase nor sorting is pushed down to hdfs, I can't really see the point of this feature. But Q3 isn't far away...

Data never sleeps
Contributor
Posts: 23

Re: SPD Engine - Hadoop

Thanks all for the useful information, very helpful.

SAS Employee
Posts: 5

Re: SPD Engine - Hadoop

In the third maintenance release for SAS 9.4, WHERE processing optimization is expanded. Using the Base SAS SPD Engine with Hadoop, you can request that data subsetting be performed in the Hadoop cluster, which takes advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of data is returned to the SAS client.

 

By default, data subsetting is performed by the SPD Engine on the SAS client. To request that data subsetting be performed in the Hadoop cluster, you must specify the ACCELWHERE= LIBNAME statement or the ACCELWHERE= data set option.

 

WHERE processing optimization supports the following syntax:

 

  • comparison operators such as EQ (=), NE (^=), GT (>), LT (<), GE (>=), LE (<=)
  • IN operator
  • full bounded range condition, such as where 500 <= empnum <= 1000;
  • BETWEEN-AND operator, such as where empnum between 500 and 1000;
  • compound expressions using the logical operators AND, OR, and NOT, such as where skill = 'java' or years = 4;
  • parentheses to control the order of evaluation, such as where (product='GRAPH' or product='STAT') and country='Canada';

For the complete documentation about WHERE processing optimization and the data set and SAS code requirements, see WHERE Processing Optimization with MapReduce.

 

 

 

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 683 views
  • 3 likes
  • 4 in conversation