Where did your data originate? Was it in a flat file, a SAS dataset or a DBMS? SAS Data Loader for Hadoop uses both SAS and Hadoop technologies to integrate with the source repository, interrogate the source file layout, allow for optional configuration and transport the data to a target Hive table in Hadoop. This article – one of many on SAS Data Management for Hadoop – explains how SAS Data Loader for Hadoop transforms and cleanses your data, as well as transports capabilities to and from the Hadoop cluster.
Once the data is in Hadoop, you can use the transformation capabilities of SAS Data Loader for Hadoop. The source data, though it may come from a mature and structured DBMS warehouse, will typically mingle with other types of data. This data may not be as mature or sophisticated. Maybe it didn’t attend the best boarding schools or got transported from the wrong side of the tracks. That unstructured or semi-structured data may hold relevance to a customer’s use case when enriching data; however, the current structure may not lend itself to ease of use or ability to merge the data.
Data transformation need not be limited to HiveQL or canned directives. With SAS Code Accelerator, you can write procedural code through SAS Data Step 2 syntax (DS2 for short). DS2 provides a rich environment for array processing and advanced expressions. DS2 contains over 290 different functions: aggregate, character, date, mathematical, quantile and trigonometric to name a few groupings. Need to parse a string of JSON data contained within a Hive table (or HDFS file)? No problem. Transposing a table for analytic exploration? No sweat.
SAS Data Loader for Hadoop can help by applying quality directives through the SAS Data Quality Accelerator that will standardize the data, identify semantic types, parse a string of data, or apply a matchcode across column(s) of data, to list a few features. Clean data is the difference between filtering on North Carolina and filtering on NC, nc, north Carolina, North Carolina, and nORTH cAROLINA. When dealing with dirty data, key observations can be missed and bad joins can occur when data integrity is left unchecked. The data profiling functionality in SAS Data Loader for Hadoop can be a tremendous advantage in this instance to allow users to check themselves before they wreck themselves.
Finally, once the data is transformed and cleansed it’s ready for analytic exploration.
Follow the Data Management section of the SAS Communities Library for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference: