Solved: Re: Is proc means syntax different when working with hadoop data?

Elle · Posted 08-02-2017 04:28 AM

Hello,

I'm in a situation where I need to know why EG performs very slow.

The other day I ran a proc means on a 10mil table (100 numeric vars) that took about 6 minutes.

I'm accessing EG remotely through Citrix and the data is stored in Hadoop.

I have experience coding in SAS, but this is the first time I'm using Hadoop data.

I defined the libname:

libname testlib SPDE "/sas/data/...." hdfshost=default;

I ran a simple proc means:

ods output summary=work.outstats;
proc means data=testlib.table skewness kurtosis;
run;
ods output close;

And that took me almost 6 minutes. I can perform the same task on my workstation EG in under 1 minute.

I took the class Introduction to SAS and Hadoop some time ago and I do remember some of the stats were done using slightly different code.

Is this the case? Do I need to change the syntax of proc means to take advantage of the Hadoop platform?

I thought the SAS Workspace server would automatically connect to the HadoopNode and proc means would be performed in parallel by multiple computers. Am I wrong?

An additional question: What other explanations could there be for the slow performance? I'm asking what things would I need to check or look at to understand why the server version of SAS EG runs so slow?

Thank you.

DavidGhan · Posted 08-02-2017 09:00 AM

You are using the SPDE engine on your LIBNAME statement and your are accessing data stored in HDFS (the Hadoop file system) in SPDE format. When you use the SPDE engine with Hadoop the data can be read from or written to HDFS but the data is processed by SAS wherever SAS is running. So in your case all the data is being sent from HDFS back to SAS. This method does allow you to take advantage of distributed storage in HDFS and parallel reads and writes of the data but it does not allow you to take advantage of parallel processing in Hadoop. Your process is probably slower because of the time it takes to bring the data to where SAS is executing.

If you want to take advantage of moving some of the processing into Hadoop then you should use SAS ACCESS to Hadoop and store your data as a Hive table rather than a SPDE table.

However the way you are using PROC MEANS will be an issue as well even if you use a Hive table and SAS ACCESS to Hadoop. You are requesting the statistics KURTOSIS and SKEWNESS and the last timie I checked those are examples of the types of summary statistics that cannot be calculated in hadoop via the Hive engine. If you used PROC MEANS with a Hive table and requested statistics like SUM, MEAN and others that can be calculated in Hadoop then you should see the benefit of in-database processing. The PROC MEANS code will be converted inito a HiveQL summary query to generate the summary calculations in parallel.

The other suggestion of using High Performance Analytics and HP PROCS to calculate summary statistics is also a possibility. The procedure HPSUMMARY does calculate KURTOSIS and SKEWNESS, has a syntax similar to PROC MEANS and will execute in parallel on distributed data stored in HDFS. This does require a SAS High Performance Analytics license.

http://support.sas.com/documentation/onlinedoc/hp-analytics-server/122/hpaug.pdf

View solution in original post

Kurt_Bremser · Posted 08-02-2017 04:59 AM

Could it be that they used the high performance procedures (ie proc hpsummary instead of proc means/proc summary)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

DavidGhan · Posted 08-02-2017 09:00 AM

You are using the SPDE engine on your LIBNAME statement and your are accessing data stored in HDFS (the Hadoop file system) in SPDE format. When you use the SPDE engine with Hadoop the data can be read from or written to HDFS but the data is processed by SAS wherever SAS is running. So in your case all the data is being sent from HDFS back to SAS. This method does allow you to take advantage of distributed storage in HDFS and parallel reads and writes of the data but it does not allow you to take advantage of parallel processing in Hadoop. Your process is probably slower because of the time it takes to bring the data to where SAS is executing.

If you want to take advantage of moving some of the processing into Hadoop then you should use SAS ACCESS to Hadoop and store your data as a Hive table rather than a SPDE table.

However the way you are using PROC MEANS will be an issue as well even if you use a Hive table and SAS ACCESS to Hadoop. You are requesting the statistics KURTOSIS and SKEWNESS and the last timie I checked those are examples of the types of summary statistics that cannot be calculated in hadoop via the Hive engine. If you used PROC MEANS with a Hive table and requested statistics like SUM, MEAN and others that can be calculated in Hadoop then you should see the benefit of in-database processing. The PROC MEANS code will be converted inito a HiveQL summary query to generate the summary calculations in parallel.

The other suggestion of using High Performance Analytics and HP PROCS to calculate summary statistics is also a possibility. The procedure HPSUMMARY does calculate KURTOSIS and SKEWNESS, has a syntax similar to PROC MEANS and will execute in parallel on distributed data stored in HDFS. This does require a SAS High Performance Analytics license.

http://support.sas.com/documentation/onlinedoc/hp-analytics-server/122/hpaug.pdf

Is proc means syntax different when working with hadoop data?

Re: Is proc means syntax different when working with hadoop data?

Re: Is proc means syntax different when working with hadoop data?

Re: Is proc means syntax different when working with hadoop data?

Catch up on SAS Innovate 2026