BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
msd83
Calcite | Level 5

Hi all,

I have a SAS programme that accesses the datasets stored in Hadoop using the SPD Engine.

I have read in SAS documentation that when the Hadoop engine is used with Hive tables, some SAS functions can be passed through to Hadoop for processing.

Are there any functions that can be used when the SPD engine is used?

Many thanks all for your time

1 ACCEPTED SOLUTION

Accepted Solutions
JBailey
Barite | Level 11

The SPDE for HDFS documentation is available here --> SAS(R) 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

The big take-away is that the SPD Engine for HDFS does not use MapReduce and WHERE processing isn't fully implemented, yet. It is part of SAS Foundation which means you don't need SAS ACCESS. Keep in mind it is a LIBNAME engine, but not a SAS/ACCESS engine.

Many, perhaps all, of the example SPDE LIBNAME statements look like magic. For example, they don't mention any of the connection information for HDFS. In order to connect you will need to set two environment variables.

  1. SAS_HADOOP_JAR_PATH= this is the location of the JAR files that SAS needs in order to connect to Hadoop. This is covered in the ACCESS manual. You will likely need to get these files from your Hadoop administrator.
  2. SAS_HADOOP_CONFIG_PATH= You will need to copy the core-site.xml, hdfs-site.xml and mapred-site.xml files onto the machine where SAS is running. Put them in a directory and point this environment variable to that directory. You will likely need to get these files from your Hadoop administrator.

View solution in original post

10 REPLIES 10
LinusH
Tourmaline | Level 20

I don't have any truth to offer here, and I'm also curious to hear some more details from the Institute, since the documentation about this configuration is quite scant.

In the meanwhile, I'll take the liberty to elaborate around the subject.

What the SPDE do is:

  • offer parallelism for I/O reads and index updates
  • Parallel where evaluation
  • Implicit parallel sorting

So, any other calculations such as grouping is not passed to SPDE, it's done in the SAS process. And, SPDE is not a data base engine which you could execute pass-thru queries to.

So I wouldn't expect that any calculation will be passed to Hadoop. The only possible functions is those that are supported for parallel WHERE-evaluation.

Data never sleeps
msd83
Calcite | Level 5

Hi LinusH,

Thanks for the explanation. Thanks for confirming that SPDE cannot be using for pass-through queries.

I will need to use the Hadoop engine for that.

LinusH
Tourmaline | Level 20

If you have a license for SAS/ACCESS to Hadoop and a need for a full blown MPP implementation, this would be the best choice.

I haven't tried either of them in a sharp production environment yet, so this is my best guess.

If you have a use case, it would be fairly easy to see the difference in load/query performance.

Data never sleeps
msd83
Calcite | Level 5

I have got a license and it is something I will be exploring. Thanks again.

JBailey
Barite | Level 11

The SPDE for HDFS documentation is available here --> SAS(R) 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

The big take-away is that the SPD Engine for HDFS does not use MapReduce and WHERE processing isn't fully implemented, yet. It is part of SAS Foundation which means you don't need SAS ACCESS. Keep in mind it is a LIBNAME engine, but not a SAS/ACCESS engine.

Many, perhaps all, of the example SPDE LIBNAME statements look like magic. For example, they don't mention any of the connection information for HDFS. In order to connect you will need to set two environment variables.

  1. SAS_HADOOP_JAR_PATH= this is the location of the JAR files that SAS needs in order to connect to Hadoop. This is covered in the ACCESS manual. You will likely need to get these files from your Hadoop administrator.
  2. SAS_HADOOP_CONFIG_PATH= You will need to copy the core-site.xml, hdfs-site.xml and mapred-site.xml files onto the machine where SAS is running. Put them in a directory and point this environment variable to that directory. You will likely need to get these files from your Hadoop administrator.
LinusH
Tourmaline | Level 20

I've read it, and as I said, it does not clarify which features that are supported (unless the "hard" ones, the support for different ds options and file types).

Can you elaborate more about where clauses, what does "isn't fully implemented" mean?

Parallel update of indexes - that must be supported, right? Which in my world would mean MPP wise.

And what about implicit sorting - does that take place in hdfs, or in the SAS client?

Data never sleeps
JBailey
Barite | Level 11

With regards to WHERE clauses not being fully implemented. It means that WHERE clauses do not currently pass down to HDFS. This functionality is planned for the next release of SAS (sometime in Q3). In order to answer your last two questions, I will need to do some research. I don't know the answers.

LinusH
Tourmaline | Level 20

Please, do, I would appreciate it much!

Since if neither where-clase nor sorting is pushed down to hdfs, I can't really see the point of this feature. But Q3 isn't far away...

Data never sleeps
msd83
Calcite | Level 5

Thanks all for the useful information, very helpful.

DWarner
SAS Employee

In the third maintenance release for SAS 9.4, WHERE processing optimization is expanded. Using the Base SAS SPD Engine with Hadoop, you can request that data subsetting be performed in the Hadoop cluster, which takes advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of data is returned to the SAS client.

 

By default, data subsetting is performed by the SPD Engine on the SAS client. To request that data subsetting be performed in the Hadoop cluster, you must specify the ACCELWHERE= LIBNAME statement or the ACCELWHERE= data set option.

 

WHERE processing optimization supports the following syntax:

 

  • comparison operators such as EQ (=), NE (^=), GT (>), LT (<), GE (>=), LE (<=)
  • IN operator
  • full bounded range condition, such as where 500 <= empnum <= 1000;
  • BETWEEN-AND operator, such as where empnum between 500 and 1000;
  • compound expressions using the logical operators AND, OR, and NOT, such as where skill = 'java' or years = 4;
  • parentheses to control the order of evaluation, such as where (product='GRAPH' or product='STAT') and country='Canada';

For the complete documentation about WHERE processing optimization and the data set and SAS code requirements, see WHERE Processing Optimization with MapReduce.

 

 

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1951 views
  • 3 likes
  • 4 in conversation