Writing data to Hadoop by using SAS libname(Performance issue)

Reply
Highlighted
Contributor
Posts: 34

Writing data to Hadoop by using SAS libname(Performance issue)

Hello All,

 

I am connecting to hadoop & writing a SAS dataset to hadoop using a libname statement.

 

libname hdptgt hadoop server=&server port=10000 schema=sample config="&hadoop_config_file"; /*parameters passed from unix*/

 

/** sas code **/

data hdptgt.main_table;

merge main_table sub_table;

by rec_id;

run;

 

Log resolution:-

NOTE: There were 290000000 observations read from the data set WORK.MAIN_TABLE.
NOTE: There were 10000000 observations read from the data set WORK.SUB_TABLE.
NOTE: The data set HDP.MAIN_TABLE has 290000000 observations and 50 variables.

 

real time           8:30:04.19

cpu time            34:31.04

 

This takes around 8 hrs 30 mins. Is there anything i could do to run this fast ? any help would be appreciated.

Super User
Posts: 5,389

Re: Writing data to Hadoop by using SAS libname(Performance issue)

Typically, on would use BULKLOAD to speed up RDBMS write operations.

Unfortunately, for Hive this is just a syntax support, there is the same underlying process that is used.

I would start with adding 

options msglevel=i sastrace=',,,d' sastraceloc=saslog nostsuffix;

to your program to better analyze what's going on on the Hive side.

Other than that, I think this is a matter of hdfs/Hive optimization issue (given that you can rule out network bottlenecks, or local SAS session ones during read/merge operation).

Data never sleeps
Contributor
Posts: 34

Re: Writing data to Hadoop by using SAS libname(Performance issue)

Unfortunately that doesn't make much of a difference, found this as well.

http://support.sas.com/documentation/cdl/en/acreldb/65247/HTML/default/viewer.htm#n0mnrn0q9n41atn194...

 

Ask a Question
Discussion stats
  • 2 replies
  • 88 views
  • 0 likes
  • 2 in conversation