BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BStone
Obsidian | Level 7

Hi,

Hope someone can clarify the following for me:

I would like to know what does the process involve in moving from HDFS to Cloudera Hadoop from a VA point of view?

For example. if we have VA set up to use HDFS file system and later decide to user the orginisation Cloudera Hadoop system, how is that accomplished and is it a fairly straight forward process?

Would it require us to set up VA from scratch to use the organisation's CDH?

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
DavidHenderson
SAS Employee

SAS LASR Analytic Server's use of HDFS deserves a bit of explanation here...

If your data is sitting in your corporate deployment of Hadoop, say Cloudera, and is used for non-LASR activities (such as Pig or Hive jobs), it isn't in a form that can be used directly by LASR.  When you use SAS to move data onto the co-located HDFS that is used by LASR, the resulting data file (.SASHDAT) is very specific to what LASR needs.  Once in that format, it isn't usable by other Hadoop processes, but it is very efficient to load those files into LASR because it is laid out on disk exactly the same as it is in memory.  LASR uses a Linux process called memory mapping to map the locations on disk into RAM memory.  The additional benefit is that if the OS needs to page that memory out, it doesn't have to write it to disk-- it's already on disk and the OS knows where.  So the OS can just drop it from memory during the page swap.  When it comes time to swap back into memory, the OS picks it up from the original location on disk.

If you want to, you can install LASR directly on your corporate deployment of Cloudera (or Apache Hadoop or HortonWorks or BigInsights, etc.).  However, even if you do, you'll need to decide if you want to create SASHDAT files in addition to keeping your data in its traditional form for use by Hadoop processes.  The main benefits are listed above.  The drawbacks are additional disk usage (more than double in many cases) and the need to somehow keep the two data sources in sync.

View solution in original post

9 REPLIES 9
LinusH
Tourmaline | Level 20

I doubt that there many (any?) use cases for this scenario.

You probably get more feedback if you discuss this with a SAS VA product manager and/or VA/Hadoop specialists at SAS Professional Services.

Data never sleeps
justin_sas
SAS Employee

Hey BStone,

There are a number of ways to do this, please feel free to call Tech Support as they can definitely assist you.

The question I have for you is, do you want CDH to be the co-located data provider? Or are just interested in pulling data from CDH in VA?

If its the first you can reload the data in CDH's HDFS, then you would have to change some config settings to point to CDH. Depending on what version of VA it maybe easier to upgrade and reconfigure.

If its the second then you just need Access to Hadoop.

Again I suggest contacting Tech Support.

Cheers,

Justin

BStone
Obsidian | Level 7

Thanks for your answers.

, it is not confirmed yet, but it will most likely be qeurying data from CDH into VA, guess SAS/ACCESS to Hadoop will be the answer in achieving this.

Cheers.

JBailey
Barite | Level 11

Hi BStone,

You are correct. SAS/ACCESS to Hadoop allows you to move data from a non-VA cluster into your VA environment.

Best wishes,

Jeff

jakarman
Barite | Level 11

Suppose you would want to buy/use a car....

French cars are  giving the best comfort

German cars are technical better

Italians for the nicest design

American for the muscles

Japanese for the manufacturing

Would you compose something by your own or choose and use one according a best fit.

Why are we still thinking in building by components in IT?

IMHO some signal of not having a certain maturity level.

---->-- ja karman --<-----
DavidHenderson
SAS Employee

SAS LASR Analytic Server's use of HDFS deserves a bit of explanation here...

If your data is sitting in your corporate deployment of Hadoop, say Cloudera, and is used for non-LASR activities (such as Pig or Hive jobs), it isn't in a form that can be used directly by LASR.  When you use SAS to move data onto the co-located HDFS that is used by LASR, the resulting data file (.SASHDAT) is very specific to what LASR needs.  Once in that format, it isn't usable by other Hadoop processes, but it is very efficient to load those files into LASR because it is laid out on disk exactly the same as it is in memory.  LASR uses a Linux process called memory mapping to map the locations on disk into RAM memory.  The additional benefit is that if the OS needs to page that memory out, it doesn't have to write it to disk-- it's already on disk and the OS knows where.  So the OS can just drop it from memory during the page swap.  When it comes time to swap back into memory, the OS picks it up from the original location on disk.

If you want to, you can install LASR directly on your corporate deployment of Cloudera (or Apache Hadoop or HortonWorks or BigInsights, etc.).  However, even if you do, you'll need to decide if you want to create SASHDAT files in addition to keeping your data in its traditional form for use by Hadoop processes.  The main benefits are listed above.  The drawbacks are additional disk usage (more than double in many cases) and the need to somehow keep the two data sources in sync.

BStone
Obsidian | Level 7

Thanks for that insight .

However, installing LASR on top of CDH will not be an option in this case.

Just to clarify, are you saying that data stored in CDH may not be in the required format for LASR - even when using SAS/ACCESS to Hadoop?

LinusH
Tourmaline | Level 20

Two different things.

- SAS/ACCESS to Hadoop makes your CDH data available for SAS and VA data builder steps etc

. Then SAS stores the selected into .SASHDAT files in the co-located hdfs. At this point the data is in the LASR required format.

So original data stored in CDH is not in the required format, you import the data and transform it in SAS to the required format.

Data never sleeps
DavidHenderson
SAS Employee

No, I am saying that data stored in a typical Hadoop format isn't the format needed by LASR to map it directly into memory.  SAS/ACCESS for Hadoop becomes the method through which that data is read from its form in CDH and then put into LASR (either directly into memory using a parallel load of streamed in from the workspace server, or by writing it into LASR's HDFS as a SASHDAT file and then mem-mapped into memory.)

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Tips for filtering data sources in SAS Visual Analytics

See how to use one filter for multiple data sources by mapping your data from SAS’ Alexandria McCall.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 2749 views
  • 8 likes
  • 6 in conversation