Hi,
Hope someone can clarify the following for me:
I would like to know what does the process involve in moving from HDFS to Cloudera Hadoop from a VA point of view?
For example. if we have VA set up to use HDFS file system and later decide to user the orginisation Cloudera Hadoop system, how is that accomplished and is it a fairly straight forward process?
Would it require us to set up VA from scratch to use the organisation's CDH?
Thanks.
SAS LASR Analytic Server's use of HDFS deserves a bit of explanation here...
If your data is sitting in your corporate deployment of Hadoop, say Cloudera, and is used for non-LASR activities (such as Pig or Hive jobs), it isn't in a form that can be used directly by LASR. When you use SAS to move data onto the co-located HDFS that is used by LASR, the resulting data file (.SASHDAT) is very specific to what LASR needs. Once in that format, it isn't usable by other Hadoop processes, but it is very efficient to load those files into LASR because it is laid out on disk exactly the same as it is in memory. LASR uses a Linux process called memory mapping to map the locations on disk into RAM memory. The additional benefit is that if the OS needs to page that memory out, it doesn't have to write it to disk-- it's already on disk and the OS knows where. So the OS can just drop it from memory during the page swap. When it comes time to swap back into memory, the OS picks it up from the original location on disk.
If you want to, you can install LASR directly on your corporate deployment of Cloudera (or Apache Hadoop or HortonWorks or BigInsights, etc.). However, even if you do, you'll need to decide if you want to create SASHDAT files in addition to keeping your data in its traditional form for use by Hadoop processes. The main benefits are listed above. The drawbacks are additional disk usage (more than double in many cases) and the need to somehow keep the two data sources in sync.
I doubt that there many (any?) use cases for this scenario.
You probably get more feedback if you discuss this with a SAS VA product manager and/or VA/Hadoop specialists at SAS Professional Services.
Hey BStone,
There are a number of ways to do this, please feel free to call Tech Support as they can definitely assist you.
The question I have for you is, do you want CDH to be the co-located data provider? Or are just interested in pulling data from CDH in VA?
If its the first you can reload the data in CDH's HDFS, then you would have to change some config settings to point to CDH. Depending on what version of VA it maybe easier to upgrade and reconfigure.
If its the second then you just need Access to Hadoop.
Again I suggest contacting Tech Support.
Cheers,
Justin
Hi BStone,
You are correct. SAS/ACCESS to Hadoop allows you to move data from a non-VA cluster into your VA environment.
Best wishes,
Jeff
Suppose you would want to buy/use a car....
French cars are giving the best comfort
German cars are technical better
Italians for the nicest design
American for the muscles
Japanese for the manufacturing
Would you compose something by your own or choose and use one according a best fit.
Why are we still thinking in building by components in IT?
IMHO some signal of not having a certain maturity level.
SAS LASR Analytic Server's use of HDFS deserves a bit of explanation here...
If your data is sitting in your corporate deployment of Hadoop, say Cloudera, and is used for non-LASR activities (such as Pig or Hive jobs), it isn't in a form that can be used directly by LASR. When you use SAS to move data onto the co-located HDFS that is used by LASR, the resulting data file (.SASHDAT) is very specific to what LASR needs. Once in that format, it isn't usable by other Hadoop processes, but it is very efficient to load those files into LASR because it is laid out on disk exactly the same as it is in memory. LASR uses a Linux process called memory mapping to map the locations on disk into RAM memory. The additional benefit is that if the OS needs to page that memory out, it doesn't have to write it to disk-- it's already on disk and the OS knows where. So the OS can just drop it from memory during the page swap. When it comes time to swap back into memory, the OS picks it up from the original location on disk.
If you want to, you can install LASR directly on your corporate deployment of Cloudera (or Apache Hadoop or HortonWorks or BigInsights, etc.). However, even if you do, you'll need to decide if you want to create SASHDAT files in addition to keeping your data in its traditional form for use by Hadoop processes. The main benefits are listed above. The drawbacks are additional disk usage (more than double in many cases) and the need to somehow keep the two data sources in sync.
Two different things.
- SAS/ACCESS to Hadoop makes your CDH data available for SAS and VA data builder steps etc
. Then SAS stores the selected into .SASHDAT files in the co-located hdfs. At this point the data is in the LASR required format.
So original data stored in CDH is not in the required format, you import the data and transform it in SAS to the required format.
No, I am saying that data stored in a typical Hadoop format isn't the format needed by LASR to map it directly into memory. SAS/ACCESS for Hadoop becomes the method through which that data is read from its form in CDH and then put into LASR (either directly into memory using a parallel load of streamed in from the workspace server, or by writing it into LASR's HDFS as a SASHDAT file and then mem-mapped into memory.)
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
See how to use one filter for multiple data sources by mapping your data from SAS’ Alexandria McCall.
Find more tutorials on the SAS Users YouTube channel.