Hello,
I'm in a situation where I need to know why EG performs very slow.
The other day I ran a proc means on a 10mil table (100 numeric vars) that took about 6 minutes.
I'm accessing EG remotely through Citrix and the data is stored in Hadoop.
I have experience coding in SAS, but this is the first time I'm using Hadoop data.
I defined the libname:
libname testlib SPDE "/sas/data/...." hdfshost=default;
I ran a simple proc means:
ods output summary=work.outstats;
proc means data=testlib.table skewness kurtosis;
run;
ods output close;
And that took me almost 6 minutes. I can perform the same task on my workstation EG in under 1 minute.
I took the class Introduction to SAS and Hadoop some time ago and I do remember some of the stats were done using slightly different code.
Is this the case? Do I need to change the syntax of proc means to take advantage of the Hadoop platform?
I thought the SAS Workspace server would automatically connect to the HadoopNode and proc means would be performed in parallel by multiple computers. Am I wrong?
An additional question: What other explanations could there be for the slow performance? I'm asking what things would I need to check or look at to understand why the server version of SAS EG runs so slow?
Thank you.
You are using the SPDE engine on your LIBNAME statement and your are accessing data stored in HDFS (the Hadoop file system) in SPDE format. When you use the SPDE engine with Hadoop the data can be read from or written to HDFS but the data is processed by SAS wherever SAS is running. So in your case all the data is being sent from HDFS back to SAS. This method does allow you to take advantage of distributed storage in HDFS and parallel reads and writes of the data but it does not allow you to take advantage of parallel processing in Hadoop. Your process is probably slower because of the time it takes to bring the data to where SAS is executing.
If you want to take advantage of moving some of the processing into Hadoop then you should use SAS ACCESS to Hadoop and store your data as a Hive table rather than a SPDE table.
However the way you are using PROC MEANS will be an issue as well even if you use a Hive table and SAS ACCESS to Hadoop. You are requesting the statistics KURTOSIS and SKEWNESS and the last timie I checked those are examples of the types of summary statistics that cannot be calculated in hadoop via the Hive engine. If you used PROC MEANS with a Hive table and requested statistics like SUM, MEAN and others that can be calculated in Hadoop then you should see the benefit of in-database processing. The PROC MEANS code will be converted inito a HiveQL summary query to generate the summary calculations in parallel.
The other suggestion of using High Performance Analytics and HP PROCS to calculate summary statistics is also a possibility. The procedure HPSUMMARY does calculate KURTOSIS and SKEWNESS, has a syntax similar to PROC MEANS and will execute in parallel on distributed data stored in HDFS. This does require a SAS High Performance Analytics license.
http://support.sas.com/documentation/onlinedoc/hp-analytics-server/122/hpaug.pdf
Could it be that they used the high performance procedures (ie proc hpsummary instead of proc means/proc summary)?
You are using the SPDE engine on your LIBNAME statement and your are accessing data stored in HDFS (the Hadoop file system) in SPDE format. When you use the SPDE engine with Hadoop the data can be read from or written to HDFS but the data is processed by SAS wherever SAS is running. So in your case all the data is being sent from HDFS back to SAS. This method does allow you to take advantage of distributed storage in HDFS and parallel reads and writes of the data but it does not allow you to take advantage of parallel processing in Hadoop. Your process is probably slower because of the time it takes to bring the data to where SAS is executing.
If you want to take advantage of moving some of the processing into Hadoop then you should use SAS ACCESS to Hadoop and store your data as a Hive table rather than a SPDE table.
However the way you are using PROC MEANS will be an issue as well even if you use a Hive table and SAS ACCESS to Hadoop. You are requesting the statistics KURTOSIS and SKEWNESS and the last timie I checked those are examples of the types of summary statistics that cannot be calculated in hadoop via the Hive engine. If you used PROC MEANS with a Hive table and requested statistics like SUM, MEAN and others that can be calculated in Hadoop then you should see the benefit of in-database processing. The PROC MEANS code will be converted inito a HiveQL summary query to generate the summary calculations in parallel.
The other suggestion of using High Performance Analytics and HP PROCS to calculate summary statistics is also a possibility. The procedure HPSUMMARY does calculate KURTOSIS and SKEWNESS, has a syntax similar to PROC MEANS and will execute in parallel on distributed data stored in HDFS. This does require a SAS High Performance Analytics license.
http://support.sas.com/documentation/onlinedoc/hp-analytics-server/122/hpaug.pdf
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.