Effective Use of the Data Builder for Data Preparation - Opinions Appr...

Scott_Mitchell · Posted 08-03-2017 05:47 AM

Good Afternoon Community Members,

I am seeking opinions on the use of the Data Builder for data preparation.

There seems to be 2 schools of thought on the matter, based on developers I have discussed this with recently. Below are their positions as well as the pros and cons as they see it.

Data preparation should always be undertaken prior to data being loaded into the LASR server.

Pros

Trimming datasets of extraneous columns and data means that the table is taking up less memory on the LASR server.
Fewer processes undertaken on the LASR server means better performance for all end users and allows us to "get more done on the same budget".

Cons

VA is a self-serve, business focused reporting and analytic tool which needs to remain responsive to the businesses needs. Having to request I.T add a variable to a table(not to mention the lead time, testing and documentation required) output to the autoloader detracts from the businesses ability to remain "agile" and "responsive". I need to make the change now not in a fortnight.
Resources with the appropriate skillset need to be available to perform transformations, subsets and aggregations.

Data preparation should be undertaken within the data preparation tool unless entirely necessary.

Pros

Making changes to the existing dataset is simple using the intuitive interface. If I need to add a variable to my query I can do so in a short period of time, make the relevant change to my report and the business can consume the data within the same day (obviously an over simplification, but you get the point).
During the VA training course, the SAS trainer encouraged us to make use of the Data Query to perform our data preparation, that is what it is there for.
Memory is relatively cheap compared to processing power. Although we are using a finite amount of memory, the benefits of being able to return the end users report in a timely manner outweighs the cost associated with purchasing additional resources to meet existing needs.

Cons

The business users don't have the technical expertise to work with a (perceived) I.T. tool. They will fail to document the changes, test their outcomes or build sufficient rigour into their processes.
The resulting queries cannot be scheduled to occur in line with when the autoload process of the underlying table takes place. I can schedule for the query to be execute every day at Xam/pm or multiple times per day, but the user could access the report outside of these windows and make decisions based on old data. Read More...
Using the data query creates a second table with the LASR server with some duplicated information. This will cause us to run out of memory.

JuanS_OCS · Posted 08-03-2017 08:44 AM

Hello @Scott_Mitchell,

both points of view are correct 🙂 And if you give a closer look, it is the same question as with most tools which intend to provide high performance to the users (SAS and non-SAS). The first one can be related to technical teams (administrators, installers, architects) and the other one is closer to users and marketing.

The solution to your question is both and none, but "managing expectations" and "risk management". Let me explain.

SAS VA has come to the market as a quite innovative tool in the SAS scenario, because it provides high-performance but also easiness of usage. But of course, careful with it, because as in any other similar tool, what you will need is what I mentioned above by:

- Making a separation between what you can be in control (batch jobs/queries and interative sessions from power users), and what you cannot be in control that easily (interactive sessions from non-SAS-expert users). From the first group you can be incontrol of usage of resources and performance. From the second group, not that much and you would need to consider the risks of any self-service service.

- This separation can be done by splitting the SAS Application Servers (SASApps) and the LASR servers, and granting/denying permissions on the SAS metadata. On each of them you can separate who can access, how many resources, even logging options.

If you ask me, I am more into the first group. As much data preparation to be done upfront, filtering all the data-lake, DWH, all data origins, and loading the necessary data for the users and reports into LASR/memory. No more, no less. With this, which we can consider what is loaded into LASR as the VA datamart, just a view of all the data from your company, a small fraction.

Then, the users will want to load their own data, or create their own queries with self-service. That is OK, but my advice for you is to do it on a different LASR server, allocating and limiting resources, and communicating to all your users about the architecture and capabilities.

As long as all have what they want, you will have happy users 🙂

LinusH · Posted 08-04-2017 10:00 AM

Since both these alternatives are declared as absolute, I don't agree with any of them.
VA itself hosts several types of use cases. Primarily enterprise wide reporting (good ol' BI), and explorative analytics.
For the first use case, data should be fed from a trusted and quality assured source - a data warehouse. Then it makes sense to use a standard/the same ETL for the whole data lifecycle. Like DI Studio if you are a 100% SAS shop.
But for other use cases when there are more of ad hoc analysis, you don't want to force users to specify their requirements on beforehand. In this scenario, i can't see why a self service approach shouldn't be acceptable. But of course, the environment needs a dialogue between users and the maintenance team, and guidelines so that it can eveolve in a cotrolled manner.

Data never sleeps

Effective Use of the Data Builder for Data Preparation - Opinions Appreciated

Re: Effective Use of the Data Builder for Data Preparation - Opinions Appreciated

Re: Effective Use of the Data Builder for Data Preparation - Opinions Appreciated

Effective Use of the Data Builder for Data Preparation - Opinions Appreciated

Re: Effective Use of the Data Builder for Data Preparation - Opinions Appreciated

Re: Effective Use of the Data Builder for Data Preparation - Opinions Appreciated

Registration is open