Hi all,
I have a SAS programme that accesses the datasets stored in Hadoop using the SPD Engine.
I have read in SAS documentation that when the Hadoop engine is used with Hive tables, some SAS functions can be passed through to Hadoop for processing.
Are there any functions that can be used when the SPD engine is used?
Many thanks all for your time
The SPDE for HDFS documentation is available here --> SAS(R) 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System
The big take-away is that the SPD Engine for HDFS does not use MapReduce and WHERE processing isn't fully implemented, yet. It is part of SAS Foundation which means you don't need SAS ACCESS. Keep in mind it is a LIBNAME engine, but not a SAS/ACCESS engine.
Many, perhaps all, of the example SPDE LIBNAME statements look like magic. For example, they don't mention any of the connection information for HDFS. In order to connect you will need to set two environment variables.
I don't have any truth to offer here, and I'm also curious to hear some more details from the Institute, since the documentation about this configuration is quite scant.
In the meanwhile, I'll take the liberty to elaborate around the subject.
What the SPDE do is:
So, any other calculations such as grouping is not passed to SPDE, it's done in the SAS process. And, SPDE is not a data base engine which you could execute pass-thru queries to.
So I wouldn't expect that any calculation will be passed to Hadoop. The only possible functions is those that are supported for parallel WHERE-evaluation.
Hi LinusH,
Thanks for the explanation. Thanks for confirming that SPDE cannot be using for pass-through queries.
I will need to use the Hadoop engine for that.
If you have a license for SAS/ACCESS to Hadoop and a need for a full blown MPP implementation, this would be the best choice.
I haven't tried either of them in a sharp production environment yet, so this is my best guess.
If you have a use case, it would be fairly easy to see the difference in load/query performance.
I have got a license and it is something I will be exploring. Thanks again.
The SPDE for HDFS documentation is available here --> SAS(R) 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System
The big take-away is that the SPD Engine for HDFS does not use MapReduce and WHERE processing isn't fully implemented, yet. It is part of SAS Foundation which means you don't need SAS ACCESS. Keep in mind it is a LIBNAME engine, but not a SAS/ACCESS engine.
Many, perhaps all, of the example SPDE LIBNAME statements look like magic. For example, they don't mention any of the connection information for HDFS. In order to connect you will need to set two environment variables.
I've read it, and as I said, it does not clarify which features that are supported (unless the "hard" ones, the support for different ds options and file types).
Can you elaborate more about where clauses, what does "isn't fully implemented" mean?
Parallel update of indexes - that must be supported, right? Which in my world would mean MPP wise.
And what about implicit sorting - does that take place in hdfs, or in the SAS client?
With regards to WHERE clauses not being fully implemented. It means that WHERE clauses do not currently pass down to HDFS. This functionality is planned for the next release of SAS (sometime in Q3). In order to answer your last two questions, I will need to do some research. I don't know the answers.
Please, do, I would appreciate it much!
Since if neither where-clase nor sorting is pushed down to hdfs, I can't really see the point of this feature. But Q3 isn't far away...
Thanks all for the useful information, very helpful.
In the third maintenance release for SAS 9.4, WHERE processing optimization is expanded. Using the Base SAS SPD Engine with Hadoop, you can request that data subsetting be performed in the Hadoop cluster, which takes advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of data is returned to the SAS client.
By default, data subsetting is performed by the SPD Engine on the SAS client. To request that data subsetting be performed in the Hadoop cluster, you must specify the ACCELWHERE= LIBNAME statement or the ACCELWHERE= data set option.
WHERE processing optimization supports the following syntax:
For the complete documentation about WHERE processing optimization and the data set and SAS code requirements, see WHERE Processing Optimization with MapReduce.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.