Solved: Re: Non Distributed Environment - SAS VA

sat_lr · Posted 03-11-2014 01:33 AM

What is the USP to pitch in when setting up a Non-distributed version of SAS VA. My customer more interested towards seeing the benefits of SAS VA over other competitors in terms of speed, accuracy, file storage and distribution system. I am sure most of you might have come across the same situation many times. As the non-distributed version doesn't support Hadoop, I am looking for key things that can make VA distinguishable among other competitors.

DavidHenderson · Posted 03-13-2014 11:31 AM

LinusH is correct.

For complete clarity...

With both Distributed and Non-Distributed, any data source that is available to SAS can be put directly into LASR. We frequently call that streamed-in, as the data is coming from a remote source, through the SAS workspace server and into LASR. The result in an in-memory LASR table that was never written to disk.

The other option with both Distributed and Non-Distributed, any data source that is available to SAS can be copied to the LASR server prior to the load. If this were a distributed LASR server, you have the option of using the SASHDAT library engine to distribute the data across the cluster. The analogous on a non-distrubuted LASR server is to pull the data from your remote data source and write it into a SAS dataset on the LASR server. (SAS is always installed on that server as well, so this is no problem.) Once the data is local (whether on the Distributed system in HDFS or on the Non-Distributed system's disk as a SAS dataset) the data is loaded into LASR memory.

The resulting table in LASR is the same and once loaded, LASR is capable of performing the same operations against it.

View solution in original post

LinusH · Posted 03-11-2014 10:39 AM

Since the infrastructure part of distributed server environment is left out, it leaves only the in-memory server. Still good performance, but not necessary a USP - you could optimize traditional DB/cubes for fast processing as well.

So, I would say the UI (functionality, look and feel) is the main USP here. How unique it is compared to the competition I can't say, but a guess is that the more advanced analysis/statistic stuff stands out.

Data never sleeps

sat_lr · Posted 03-11-2014 11:14 AM

Surely it is...

Is that in Non-distributed (non co-located data provider) version of SAS VA, data files are stored in Hadoop but HDFS & MapReduce features are not available?

In contrast all these features are available in Distributed (co-located data provider) version of VA?

LinusH · Posted 03-11-2014 11:18 AM

No, HDFS is not available here (unless you create you own separate Hadoop install and make it available to VA using SAS/ACCESS to Hadoop).

MapReduce is not used by VA to my knowledge - HDFS is just used to stream and bulk load data to the LASR server.

Data never sleeps

sat_lr · Posted 03-13-2014 03:44 AM

ok so in this case (as a No-Colocated Data Provider) in no HDFS version of VA, how do files get stored in-memory? what format? what kind of DB structure?

LinusH · Posted 03-13-2014 08:08 AM

I'm enrolling my first hands on project in a couple of weeks, and with a non-distributed edition, and will be able to answer these kind of questions more precisely then.

But, I think that you read from any available data source via a SAS libname engine.

Data never sleeps

DavidHenderson · Posted 03-13-2014 11:31 AM

LinusH is correct.

For complete clarity...

With both Distributed and Non-Distributed, any data source that is available to SAS can be put directly into LASR. We frequently call that streamed-in, as the data is coming from a remote source, through the SAS workspace server and into LASR. The result in an in-memory LASR table that was never written to disk.

The other option with both Distributed and Non-Distributed, any data source that is available to SAS can be copied to the LASR server prior to the load. If this were a distributed LASR server, you have the option of using the SASHDAT library engine to distribute the data across the cluster. The analogous on a non-distrubuted LASR server is to pull the data from your remote data source and write it into a SAS dataset on the LASR server. (SAS is always installed on that server as well, so this is no problem.) Once the data is local (whether on the Distributed system in HDFS or on the Non-Distributed system's disk as a SAS dataset) the data is loaded into LASR memory.

The resulting table in LASR is the same and once loaded, LASR is capable of performing the same operations against it.

sat_lr · Posted 03-14-2014 02:07 AM

Hi David,

Thanks for your brief reply. much useful. In distributed environment, using SASHDAT library, data can be distributed across the cluster. I gone through some of the documentations in hadoop website and ofcourse sas website. I could see the data is replicated (nodes) and stored which actually helps in many ways such as in load balancing, optimizing query process time etc etc (too many to mention here). Today every small analytics vendors to big vendors are moving towards in-memory based architecture. When I say non-distributed, is there any key benefits the customer would get? I am looking here for key differences that I can spot between sas and other in-memory based analytics vendors.

satlr

DavidHenderson · Posted 03-14-2014 10:40 AM

@sat_lr, as an R&D person, my expertise is in the technical details and implementation of our product-- unfortunately, I am not that familiar with what our competitors offer. I have sent an email to a group of people who will be able to provide more information. Stay tuned for more information.

sat_lr · Posted 03-18-2014 02:04 AM

Thanks David...appreciate it!

LinusH · Posted 03-18-2014 05:39 AM

Just want it even more clear...

When using non-streamed-in in a non-distributed environment, you are mention SAS data sets. Are those standard Base engine data sets? Can you use any other engines here? Consider that we have SPDE/SPD Server installed on this node and it would make sense to use their multi-threaded I/O capabilities. The same could be true for other SAS/ACCESS engines as well.

Data never sleeps

DavidHenderson · Posted 03-18-2014 01:22 PM

Yes, I am talking about standard SAS datasets accessed via the base engine.

I discourage you from putting SPDE/SPD Server on the same machine as LASR. Those two are likely to compete heavily for resources, so you are likely to have performance issues if placed on the same machine. As for other SAS/ACCESS engines... again, I would recommend that the data sources are NOT placed on the same machines.

Of course you will likely still need to use the SAS/ACCESS engines to pull the data to the SMP server. Once that is done, the choice is now yours... As I said, you can place it into datasets on that machine, or stream it directly into memory. Streaming it directly into memory clearly is faster than writing it to disk, just to read it back and put it into LASR memory. But if pulling data from the remote source is slow for whatever reason, it is sometimes good to have a local copy.

Ready to join fellow brilliant minds for the SAS Hackathon?