BookmarkSubscribeRSS Feed

Your data is in Hadoop. Now what?

Started ‎12-08-2015 by
Modified ‎01-19-2016 by
Views 5,024

Where did your data originate? Was it in a flat file, a SAS dataset or a DBMS? SAS Data Loader for Hadoop uses both SAS and Hadoop technologies to integrate with the source repository, interrogate the source file layout, allow for optional configuration and transport the data to a target Hive table in Hadoop. This article – one of many on SAS Data Management for Hadoop – explains how SAS Data Loader for Hadoop transforms and cleanses your data, as well as transports capabilities to and from the Hadoop cluster.

Once the data is in Hadoop, you can use the transformation capabilities of SAS Data Loader for Hadoop. The source data, though it may come from a mature and structured DBMS warehouse, will typically mingle with other types of data. This data may not be as mature or sophisticated. Maybe it didn’t attend the best boarding schools or got transported from the wrong side of the tracks. That unstructured or semi-structured data may hold relevance to a customer’s use case when enriching data; however, the current structure may not lend itself to ease of use or ability to merge the data.

Data transformation need not be limited to HiveQL or canned directives. With SAS Code Accelerator, you can write procedural code through SAS Data Step 2 syntax (DS2 for short). DS2 provides a rich environment for array processing and advanced expressions. DS2 contains over 290 different functions: aggregate, character, date, mathematical, quantile and trigonometric to name a few groupings. Need to parse a string of JSON data contained within a Hive table (or HDFS file)? No problem. Transposing a table for analytic exploration? No sweat.

SAS Data Loader for Hadoop can help by applying quality directives through the SAS Data Quality Accelerator that will standardize the data, identify semantic types, parse a string of data, or apply a matchcode across column(s) of data, to list a few features. Clean data is the difference between filtering on North Carolina and filtering on NC, nc, north Carolina, North Carolina, and nORTH cAROLINA. When dealing with dirty data, key observations can be missed and bad joins can occur when data integrity is left unchecked. The data profiling functionality in SAS Data Loader for Hadoop can be a tremendous advantage in this instance to allow users to check themselves before they wreck themselves.

Finally, once the data is transformed and cleansed it’s ready for analytic exploration.

Follow the Data Management section of the SAS Communities Library for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference:

 

Version history
Last update:
‎01-19-2016 04:49 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags