Re: Data ingestion from SAS to Hadoop: clarifications

Aarya · Posted 01-04-2019 11:27 AM

Hello All,

I am new to SAS and trying to understand how to load data(table) from SAS to HDFS. Code is written by someone and i am finding it difficult to understand. Can someone please help me in understanding the statements ?

Snippet:

libname testingres hadoop server=servername hdfs_tempdir = 'some location'

user='abc' password='abcd' schema='abc' port=10000

DBCREATE_TABLE_OPTS='STORED AS PARQUET'

subprotocol=hive2;

data testingres.a100;

set sashelp.cars;

run;

Above snippet creates a table in Hive with the name a100.

What i want to understand is,

1. where the create table script is mentioed?

2. where libname 'testingres' will be stored in UNIX machine?

3. How does it know from where we need to pick the table that has to be ingested into HIVE.

4. In the above snippet "sashelp.cars" data will be loaded into a100 table right?

5. At present, to load 10GB of data, it takes around 18 minutes. Is there any 'parallel' option to increase the speed? OR how can i speedup the process.

NOTE: I am using SAS 8 Version.

alexal · Posted 01-04-2019 11:40 AM

@Aarya,

where the create table script is mentioed?

Data step creates a table in your Hadoop library. There is no table script.

where libname 'testingres' will be stored in UNIX machine?

That libname uses SAS/ACCESS Interface to Hadoop. So the data will be stored in Hadoop.

How does it know from where we need to pick the table that has to be ingested into HIVE.

set sashelp.cars;

In this case, CARS table from SASHELP library will be loaded to Hadoop.

In the above snippet "sashelp.cars" data will be loaded into a100 table right?

Yes.

At present, to load 10GB of data, it takes around 18 minutes. Is there any 'parallel' option to increase the speed? OR how can i speedup the process.

Yes, that is available in SAS® High-Performance Analytics.

Aarya · Posted 01-04-2019 11:53 AM

Thank you for your response Alexal. however, I am still not clear with the following.

where the create table script is mentioed?

Data step creates a table in your Hadoop library. There is no table script. :-->snippet i posted in my post creates a hive table and is working in DEV environment. creating a hive table needs a syntax. how it is creating a table is still not clear?

where libname 'testingres' will be stored in UNIX machine?

That libname uses SAS/ACCESS Interface to Hadoop. So the data will be stored in Hadoop.:---> I understand that data gets stored in Hadoop. but how do i create "testingres" libname and where it has to be present? In machine that has SAS or in Hadoop installed environment? I have SAS on server A and Hadoop on Server B.

How can i add parallel option to the above script?

alexal · Posted 01-04-2019 02:38 PM

@Aarya,

hive table and is working in DEV environment. creating a hive table needs a syntax. how it is creating a table is still not clear?

Data step along with SAS/ACCESS Interface to Hadoop creates that table. You do not need to pass any table scripts, just use data step. You can read more about data step here:

https://go.documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n1kh25to5o0wmvn1o4n4hsl3yyww.htm&...

How can i add parallel option to the above script?

First of all, you need to have SAS® High-Performance Analytics license and all of the related products installed. Do you have a license for that?

alexal · Posted 01-04-2019 02:39 PM

@Aarya,

More information about SAS/ACCESS Interface to Hadoop

Aarya · Posted 01-04-2019 11:06 PM

Thank you Alexal.
I will go through the post and check.

Aarya · Posted 01-04-2019 11:54 PM

Hi Alexal,

I just started working on SAS from past 2 days and i went through the links you provided me and i did not understand anything.

Below is the script that I am using

options set=SAS_HADOOP_RESTFUL=1;
options set=SAS_HADOOP_JAR_PATH=<jar path>;
options set=SAS_HADOOP_CONFIG_PATH=<config path>;

options nofmterr;

%let svr = %NRSTR('server.abc.com');
%let stng = %NRSTR('stored as parquet');

libname aaa hadoop server=&svr hdfs_tempdir='/tmp/sastmp' user = <username>
password = <pxxx> schema=<schema name> port 10000
DBCREATE_TABLE_OPTS=&stng
subprotocol=hive2;

libname xxx '/workspace/abc/xyz';

data aaa.test;
set xxx.abc.test;
run;

LOCATION: libname xxx '/workspace/abc/xyz';

This is the place where they drop ".sas7bdat" files and this location keeps changing. Based on that information, we make modidfication.

I ran this script on a 0.3GB sas7bdat file and it created a test table successfully in HIVE. But when i query Hive table,

select count(1) from test;

I do not see any records. How a 0.3GB file does not contain any data? am i missing something here? how can i check the contents of sas7bdat file?

Any help will be greatly appreciated.

Tom · Posted 01-06-2019 11:36 AM

@Aarya wrote:

Hi Alexal,

I just started working on SAS from past 2 days and i went through the links you provided me and i did not understand anything.

...

libname aaa hadoop server=&svr hdfs_tempdir='/tmp/sastmp' user = <username>
password = <pxxx> schema=<schema name> port 10000
DBCREATE_TABLE_OPTS=&stng
subprotocol=hive2;

libname xxx '/workspace/abc/xyz';

data aaa.test;
set xxx.abc.test;
run;

LOCATION: libname xxx '/workspace/abc/xyz';

This is the place where they drop ".sas7bdat" files and this location keeps changing. Based on that information, we make modidfication.

I ran this script on a 0.3GB sas7bdat file and it created a test table successfully in HIVE. But when i query Hive table,

select count(1) from test;

I do not see any records. How a 0.3GB file does not contain any data? am i missing something here? how can i check the contents of sas7bdat file?

Any help will be greatly appreciated.

First understand how SAS uses librefs to find datasets. You use a LIBNAME statement to define a LIBREF that points to a specific location where datasets are stored. You then use this libref (essentially it is an alias for the location) by prefixing it to the the name of a specific dataset in that libraray. Some examples from your codes: SASHELP.CLASS or AAA.TEST. (Notice that your code above is invalid since you tried to use two periods in the reference in the SET statement.)

In this block of code you are defining two librefs. One to point to physical directory on your SAS machine where the SAS datasets are stored and one to point to a Hadoop environment. One advantage of this structure is you can create a program to copy the data that is independent of the physical locations, just change the LIBNAME statement and the same code with read and/or write to the locations they point to.

So make sure that the location you are pointing your libref in the SAS code is where you are trying to read when using other tools to read the hadoop files.

Also make sure that the code you ran worked and actually created the files in the hadoop cluster. Check the SAS log from the step that creates the file.

So let's look at your libname statement to define the connection to hadoop. First let's give it a more meaningful libref. Remember that librefs are limited to 8 characters.

libname myhive hadoop
  user = <username> password = <pxxx>
  server=&svr   port 10000
  hdfs_tempdir='/tmp/sastmp' 
  schema=<schema name>
  DBCREATE_TABLE_OPTS=&stng
  subprotocol=hive2
;

I would assume that the SCHEMA= setting is the one that is telling the hadoop server where you want to find/write tables.

Then your LIBNAME statement to point to the SAS dataset. Again let's use a little more meaningful libref.

libname mysas '/workspace/abc/xyz';

So this is saying that on the machine where SAS is running use the path specified to find/create SAS datasets. With a path like that it looks like SAS is running on Unix. So the physical files in that directory should be in all lower case with an extension of .sas7bdat. There might be some other files there also if you are using other SAS features. For example if any of your datasets had indexes defined then there will be a separate file to store the index. You could also create catalogs, usually for things like format catalogs.

If the folder where the SAS files is frequently changing then this is the statement that needs to change. I noticed that your program was using some macro variables to help you create your libref to Hadoop. You could do the something similar to make it easier to change to location for the SAS datasets. You could set the path into macro variable and then use that macro variable to define the libref.

%let saspath = /workspace/abc/xyz ;
libname mysas "&saspath.";

Notice the use of double quotes instead of single quotes. Macro variable references are not resolved inside of single quotes.

Now you can use these two librefs to reference datasets (a.k.a. "tables"). For example here is a data step to create a new table in hadoop from an existing SAS dataset using the two librefs we defined above.

data myhive.myhivetable ;
  set mysas.mysasdataset ; 
run;

In a future SAS session you then reference that same table you would just need to make sure to define a libref using the same settings so that SAS can find it. The actual libref name (aka alias) you use could be different. It is the details about the server and schema etc that is important for SAS to know where to find it. For example tomorrow you might want to pull out from Hadoop just the records where MYVAR='T'.

libname oldhive hadoop ....... ;

data work.test ;
  set oldhive.myhivetable ;
  where myvar = 'T' ;
run;

When you run a simple data step like this you are telling SAS to make a NEW dataset. That is why there is no "table definition". It is all done as part of making a new dataset, so there is no need to separate the table creation and the table loading into two steps. If you did want to append data to an existing table in hive then you might want to use PROC APPEND instead.

proc append base=myhive.myhivetable data=mysas.mysasdataset ; 
run;

Aarya · Posted 01-06-2019 01:04 PM

oh Tom, not sure how to thank you, you explained in such a simple and clear way. I was banging my head to understand what am I trying here, your explanation made me think otherwise. A big Thank you from the bottom of my heart. You just made my day.

Sorry to bug you,but few clarifications.

" So the physical files in that directory should be in all lower case with an extension of .sas7bdat." :--> I just need to specify the table name without ".sas7bdat" extension right?

1. If i need to create multiple hive tables, can i do that ? something like

data myhive.myhivetable ;
  set mysas.mysasdataset ;

data myhive1.myhivetable1 ;
  set mysas.mysasdataset;

2. Based on your experience, which is better approach? approach being followed now OR converting SAS to .csv and then loading it to hive?

3. SAS is on SERVER A and Hadoop is on SERVER B. Can i call the script from server B(Hadoop) rather than server A? If csv is a better approach, I would call it from sever B and create an external hive table and point the location from where I am creating csv.

4. Finally, I may need to load SAS dataset in 500GB size, is there any way to speed it up? I have access to only UNIX box and I am not sure what all options are possible.

Once again, Thanks Tom for taking your time and replying in detail. Good day Sir..

LinusH · Posted 01-06-2019 10:11 AM

To better audit ACCESS behaviour, issue

options msglevel = I sastrace = ",,,d" sastraceloc = saslog nostsuffix;

Verify both data load and query.
Also query using Hue to rule out any SAS connection issues/translations.

Data never sleeps

Aarya · Posted 01-06-2019 10:55 AM

Thanks LinusH.

I am able to create the table but data is not getting populated. I found out in sas community that it could be because of missing property

<property>
      <name>dfs.client.use.datanode.hostname</name>
      <value>true</value>
</property>

I do not have permission to add that property. I will ask admin to make those changes. Is there any way to speed up the process?

Aarya · Posted 01-07-2019 09:57 AM

Thanks Tom,LinusH,and alexal.

I was able to load the data after making the changes suggested and it worked. Next step is how can i speed up the process? It took 1 minute 25 seconds for 0.3G.

I have to load upto 500G dataset and i checked with the team and they do not have "SAS® High-Performance Analytics license".

Thanks for all your help.

LinusH · Posted 01-07-2019 10:29 AM

Oh, where to start. So many tuning options, and we don't know which are the actual bottlenecks in your situation.

I suggest that you read som papers, like this:

https://support.sas.com/content/dam/SAS/support/en/documentation/third-party-reference/sas-hadoop-pe...

And then try to narrow down your problem, and then get back to us (in a new post), and try to describe the beaviour in much detail as possible, attaching logs to visualize. You probably need to involve Hadoop and network admins to help you with audit/monitoring.

Also, describe how the users going to work with data, and from what tools/applications. It can affect choice of file format, partitioning etc

Data never sleeps

Aarya · Posted 01-07-2019 11:37 AM

Hi LinusH,

Thank you for sharing the link. Datasets from SAS will be created as tables in Hive and users will be going to access them directly. Only criterion is, If a SAS dataset of 10GB takes 15minutes to create a hive table using the script mentioned in my post, I want to bring that down to 7-8 minutes using any method will work for me.

SAS Innovate 2025: Call for Content