BookmarkSubscribeRSS Feed
GunnerEP
Obsidian | Level 7

Hello All,

 

I am connecting to hadoop & writing a SAS dataset to hadoop using a libname statement.

 

libname hdptgt hadoop server=&server port=10000 schema=sample config="&hadoop_config_file"; /*parameters passed from unix*/

 

/** sas code **/

data hdptgt.main_table;

merge main_table sub_table;

by rec_id;

run;

 

Log resolution:-

NOTE: There were 290000000 observations read from the data set WORK.MAIN_TABLE.
NOTE: There were 10000000 observations read from the data set WORK.SUB_TABLE.
NOTE: The data set HDP.MAIN_TABLE has 290000000 observations and 50 variables.

 

real time           8:30:04.19

cpu time            34:31.04

 

This takes around 8 hrs 30 mins. Is there anything i could do to run this fast ? any help would be appreciated.

2 REPLIES 2
LinusH
Tourmaline | Level 20

Typically, on would use BULKLOAD to speed up RDBMS write operations.

Unfortunately, for Hive this is just a syntax support, there is the same underlying process that is used.

I would start with adding 

options msglevel=i sastrace=',,,d' sastraceloc=saslog nostsuffix;

to your program to better analyze what's going on on the Hive side.

Other than that, I think this is a matter of hdfs/Hive optimization issue (given that you can rule out network bottlenecks, or local SAS session ones during read/merge operation).

Data never sleeps

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1622 views
  • 0 likes
  • 2 in conversation