BookmarkSubscribeRSS Feed

Give analysts 80% of their time back

Started ‎06-25-2015 by
Modified ‎01-19-2016 by
Views 2,493

A young character named Ferris Bueller once said, “Life moves pretty fast. If you don't stop and look around once in a while, you could miss it.”  Although relevant to life, it’s also relevant to what’s happening in the data management space for Hadoop.  You don’t need to skip your Hadoop certification class to appreciate how extraordinary and fast this industry is moving, but if you slow down just a bit you can get a glimpse into what’s happening.

 

Let’s talk about three key elements that drive data management for Hadoop. First, I apologize in advance, but I have to say the phrase that is all too familiar… anyone, anyone…  “big data.”  New data paradigms are exploding and driving changes in data management practices. For most companies, big data is the reality of doing business. It’s the proliferation of structured and unstructured data that floods organizations on a daily basis – and, if managed well, that can deliver powerful insights.

 

Second, new ways of thinking around analytic design are emerging. Driven in part by the millennial generation and gaming mentality, Design involves the use of all available tools in order to experiment, innovate, and create new techniques and approaches to data and analytics, as well as refining the art of data-driven decisions.

 

Third, analytic deployment is the mature analytic framework that places significant value on putting the analytic process into production. Design is cool and necessary for innovation, but creative concepts need to be turned into cost savings, profit, or risk mitigation to add real value to the organization.

When you combine the art of analytic design, the application of deployment, and fuel them both with massive amounts of complex data, you get the new analytics culture.  As the analytic needs of this culture grow and change, so do their data management needs.

 

Analytic data prep vs. data warehousing

My colleagues and I see data preparation for the new analytics culture distinctly different from traditional data warehousing. Data warehousing techniques, and many of the tools that support them, are designed to conform data into standard schemas that are well organized and optimized for building efficient queries. The tools and processes are designed for the back office, used by a data management specialist, for the purposes of giving a finished dataset to analytic and reporting users.

Unfortunately this process falls short of providing what the end user really wants, and ultimately forces a scarce resource to perform all kinds of pre-analytic data management magic to do their job. In fact, it’s commonly understood that 80% of a statistician’s time is spent preparing the data, and subsequently re-working the data as they move through the analytic lifecycle. This disconnect between the people and technology is worth a look. More particularly, it comes with the following challenges:

    • Wide table = good, star schema = bad. Analytic works requires very wide, very detailed tables, often having hundreds or even thousands of variables. Transposition is the statistician’s friend, and pre-aggregation equals pre-determined statistics. Data doesn’t usually come out of the warehouse this way.
    • Do over. Analytic work is iterative. Data management tools as the exclusive domain of IT, paired with cumbersome business processes for getting modified datasets, forces the analytic resources to take matters into their own hands.
    • Not all quality is the same. De-duplicating data, or matching addresses can be important when considering general data quality. But analytic teams spend tons of time developing their own algorithms for analytic data preparation. Things like gender matching, parsing, match coding, imputation, or pattern matching techniques are used to enrich data for analytics.
    • The final step. Feeding data into high-performing analytic systems is work often left to the analytic people, and can be one of the more difficult tasks when the data management work isn’t tightly coupled with the analytic platforms, either physically or with common metadata.

 

How to simplify data management in Hadoop

One SAS technology that can help give back some of this 80% of lost time to the new analytics culture is SAS® Data Loader for Hadoop. This easy-to-use, self-service tool works inside the Hadoop Platform to enable:

    • data movement to and from source systems
    • data quality
    • data profiling
    • data transformation
    • data loading into our in-memory analytic platform

By providing sophisticated data management capabilities to both the design and deployment cultures, analytic people can spend more time developing innovative models, and less time working on their data, all inside the Hadoop Platform.

 

The world of analytic data management is moving pretty fast.  I can’t say you will earn a day off by giving the analytic teams more time to focus on modeling (by simplifying their data management processes), but it will certainly make you a hero!

 

Take this Ferrari for a spin by visiting the SAS® Data Loader for Hadoop web page.

 

Also, follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference:

 

Version history
Last update:
‎01-19-2016 04:40 PM
Updated by:

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels