Solved: Working with SAS in Hadoop and Spark

juanvg1972 · Posted 06-01-2017 11:29 AM

Hi,

I have to work with SAS dataset and SAS proceses in a Hadoop environment, I have some question.

- If I want to use Hive from libname I need SAS ACCESS TO HADOOP license, is correct?

- If I want to use a filename sentence to HDFS is not necesary any license isn't it?

- What licence do I need to use proc hadoop?, and proc ds2?

- Las question: is there anything similar to work in Spark environment?

If there is a more apropiated forum for this question, let me know.

Thanks in advance,

Juan

LinusH · Posted 06-02-2017 02:48 AM

If you have a data step that can't be rewritten to SQL, I think currently rewriting it to PROC DS2 and use embedded process is your only option.

For SQL (given you data is stored in Hive or other relational format), you can make use of implicit or explicit SQL pass through.

Other procedure that implicit creates pass through SQL:

FREQ

MEANS

RANK [Hadoop with Hive .13 and later]

REPORT

SORT [Hadoop with Hive .13 and later]

SUMMARY

TABULATE

TRANSPOSE [Hadoop and Teradata only]

You probably find this matrix useful:

http://support.sas.com/documentation/cdl/en/acreldb/69580/HTML/default/viewer.htm#p13td0l6w0329rn15u...

Data never sleeps

View solution in original post

LinusH · Posted 06-01-2017 01:36 PM

Hive requires ACCESS to Hadoop, yes.

Filename HDFS is covered by Base SAS, yes.

PROC HADOOP also in Base SAS.

PROC DS2 runs in Base SAS, but if you want it to execute within Hadoop, you need the embedded process, and add on to ACCESS if I recall right.

Spark can be utilized in Data Loader for Hadoop. This only mentioned once at support.sas.com, so I presume it's in its prime.

Data never sleeps

juanvg1972 · Posted 06-01-2017 04:58 PM

Thank you for your help Linush. Very useful

I would like yo know which is the best way to execute a sas program in Hadoop If I want to take advantage of parallel

execution of Hadoop. For example a sas program who makes somo data step and procs with data on Hadoop, I want

to benefit of Hadoop cluster paralelization, which is the best way??

Thanks

LinusH · Posted 06-02-2017 02:48 AM

If you have a data step that can't be rewritten to SQL, I think currently rewriting it to PROC DS2 and use embedded process is your only option.

For SQL (given you data is stored in Hive or other relational format), you can make use of implicit or explicit SQL pass through.

Other procedure that implicit creates pass through SQL:

FREQ

MEANS

RANK [Hadoop with Hive .13 and later]

REPORT

SORT [Hadoop with Hive .13 and later]

SUMMARY

TABULATE

TRANSPOSE [Hadoop and Teradata only]

You probably find this matrix useful:

http://support.sas.com/documentation/cdl/en/acreldb/69580/HTML/default/viewer.htm#p13td0l6w0329rn15u...

Data never sleeps

Working with SAS in Hadoop and Spark

Re: Working with SAS in Hadoop and Spark

Re: Working with SAS in Hadoop and Spark

Re: Working with SAS in Hadoop and Spark

Re: Working with SAS in Hadoop and Spark

Working with SAS in Hadoop and Spark

Re: Working with SAS in Hadoop and Spark

Re: Working with SAS in Hadoop and Spark

Re: Working with SAS in Hadoop and Spark

Re: Working with SAS in Hadoop and Spark

Click image to register for webinar

Classroom Training Available!