Solved: Re: VA, HDFS and Cloudera

BStone · Posted 06-19-2014 07:07 AM

Hi,

Hope someone can clarify the following for me:

I would like to know what does the process involve in moving from HDFS to Cloudera Hadoop from a VA point of view?

For example. if we have VA set up to use HDFS file system and later decide to user the orginisation Cloudera Hadoop system, how is that accomplished and is it a fairly straight forward process?

Would it require us to set up VA from scratch to use the organisation's CDH?

Thanks.

DavidHenderson · Posted 06-24-2014 10:49 AM

SAS LASR Analytic Server's use of HDFS deserves a bit of explanation here...

If your data is sitting in your corporate deployment of Hadoop, say Cloudera, and is used for non-LASR activities (such as Pig or Hive jobs), it isn't in a form that can be used directly by LASR. When you use SAS to move data onto the co-located HDFS that is used by LASR, the resulting data file (.SASHDAT) is very specific to what LASR needs. Once in that format, it isn't usable by other Hadoop processes, but it is very efficient to load those files into LASR because it is laid out on disk exactly the same as it is in memory. LASR uses a Linux process called memory mapping to map the locations on disk into RAM memory. The additional benefit is that if the OS needs to page that memory out, it doesn't have to write it to disk-- it's already on disk and the OS knows where. So the OS can just drop it from memory during the page swap. When it comes time to swap back into memory, the OS picks it up from the original location on disk.

If you want to, you can install LASR directly on your corporate deployment of Cloudera (or Apache Hadoop or HortonWorks or BigInsights, etc.). However, even if you do, you'll need to decide if you want to create SASHDAT files in addition to keeping your data in its traditional form for use by Hadoop processes. The main benefits are listed above. The drawbacks are additional disk usage (more than double in many cases) and the need to somehow keep the two data sources in sync.

View solution in original post

LinusH · Posted 06-19-2014 08:13 AM

I doubt that there many (any?) use cases for this scenario.

You probably get more feedback if you discuss this with a SAS VA product manager and/or VA/Hadoop specialists at SAS Professional Services.

Data never sleeps

justin_sas · Posted 06-19-2014 09:40 AM

Hey BStone,

There are a number of ways to do this, please feel free to call Tech Support as they can definitely assist you.

The question I have for you is, do you want CDH to be the co-located data provider? Or are just interested in pulling data from CDH in VA?

If its the first you can reload the data in CDH's HDFS, then you would have to change some config settings to point to CDH. Depending on what version of VA it maybe easier to upgrade and reconfigure.

If its the second then you just need Access to Hadoop.

Again I suggest contacting Tech Support.

Cheers,

Justin

BStone · Posted 06-20-2014 03:33 AM

Thanks for your answers.

, it is not confirmed yet, but it will most likely be qeurying data from CDH into VA, guess SAS/ACCESS to Hadoop will be the answer in achieving this.

Cheers.

JBailey · Posted 06-20-2014 08:59 AM

Hi BStone,

You are correct. SAS/ACCESS to Hadoop allows you to move data from a non-VA cluster into your VA environment.

Best wishes,

Jeff

jakarman · Posted 06-19-2014 09:44 AM

Suppose you would want to buy/use a car....

French cars are giving the best comfort

German cars are technical better

Italians for the nicest design

American for the muscles

Japanese for the manufacturing

Would you compose something by your own or choose and use one according a best fit.

Why are we still thinking in building by components in IT?

IMHO some signal of not having a certain maturity level.

---->-- ja karman --<-----

DavidHenderson · Posted 06-24-2014 10:49 AM

SAS LASR Analytic Server's use of HDFS deserves a bit of explanation here...

If your data is sitting in your corporate deployment of Hadoop, say Cloudera, and is used for non-LASR activities (such as Pig or Hive jobs), it isn't in a form that can be used directly by LASR. When you use SAS to move data onto the co-located HDFS that is used by LASR, the resulting data file (.SASHDAT) is very specific to what LASR needs. Once in that format, it isn't usable by other Hadoop processes, but it is very efficient to load those files into LASR because it is laid out on disk exactly the same as it is in memory. LASR uses a Linux process called memory mapping to map the locations on disk into RAM memory. The additional benefit is that if the OS needs to page that memory out, it doesn't have to write it to disk-- it's already on disk and the OS knows where. So the OS can just drop it from memory during the page swap. When it comes time to swap back into memory, the OS picks it up from the original location on disk.

If you want to, you can install LASR directly on your corporate deployment of Cloudera (or Apache Hadoop or HortonWorks or BigInsights, etc.). However, even if you do, you'll need to decide if you want to create SASHDAT files in addition to keeping your data in its traditional form for use by Hadoop processes. The main benefits are listed above. The drawbacks are additional disk usage (more than double in many cases) and the need to somehow keep the two data sources in sync.

BStone · Posted 06-25-2014 10:55 AM

Thanks for that insight .

However, installing LASR on top of CDH will not be an option in this case.

Just to clarify, are you saying that data stored in CDH may not be in the required format for LASR - even when using SAS/ACCESS to Hadoop?

LinusH · Posted 06-25-2014 11:28 AM

Two different things.

- SAS/ACCESS to Hadoop makes your CDH data available for SAS and VA data builder steps etc

. Then SAS stores the selected into .SASHDAT files in the co-located hdfs. At this point the data is in the LASR required format.

So original data stored in CDH is not in the required format, you import the data and transform it in SAS to the required format.

Data never sleeps

DavidHenderson · Posted 06-25-2014 01:29 PM

No, I am saying that data stored in a typical Hadoop format isn't the format needed by LASR to map it directly into memory. SAS/ACCESS for Hadoop becomes the method through which that data is read from its form in CDH and then put into LASR (either directly into memory using a parallel load of streamed in from the workspace server, or by writing it into LASR's HDFS as a SASHDAT file and then mem-mapped into memory.)

Registration is open