09-23-2016 11:37 AM - edited 09-23-2016 11:44 AM
As part of my course curriculum for MS in Analytics program, there was an assignment to discuss different aspects of data preprocessing. As data preprocessing consumes considerable amount of time in the entire Analytic Life Cycle, I thought of sharing it with SAS Community.
Different Aspects of Data preprocessing include:
Best Practices of data preprocessing:
Analysts work through “dirty data quality issues” in data mining projects be they, noisy (inaccurate), missing, incomplete, or inconsistent data. Before embarking on data mining process, it is prudent to verify that data is clean to meet organizational processes and clients’ data quality expectations.
Kandel, et., al (2011) in their research paper have discussed methodologies that could revolutionize how data wrangling is performed. In their paper, the authors call data quality issues as “elephant in the room”. Per Geiger (2004), at the time, poor data quality costed $600 Billion annually across American companies, as quoted by Data Warehousing Institute. Some authors estimate the time and efforts consumed due to poor data quality accounts for 60-80% of data mining efforts.
Data Quality Improvement Practices:
The practices that are discussed in Kandel, et., al (2011) paper addresses:
Per Kandel (2011), data transformations follow visualizations similar to visualization of data quality issues including reformatting, extraction, outlier correction, and schema mapping. Due to constant updating of data and schemas, constant editing and auditing of data transformation is needed. So the data output, the author argues is not just transformed data, but rather editable and auditable description of data transformations applied to improve repeatability and modification.
Corrected or “clean data” from different sources and types are integrated utilizing graphical tools with schema matching, with algorithms and interactive tool development.
Per Han et., al (2012), data reduction is applied to gain efficiencies in providing faster analytical results. Strategies include dimensionality reduction (random variable reduction), Numerosity reduction (alternative smaller forms of data), and data compression (Transformation applied to “compressed” version of original data without losing information from original data. The authors caution the efforts spent in data reduction should not exceed benefits gained by it.
Examples of my experience in processing data from different databases.
I have used the following databases to process data:
Microsoft Excel: I have noticed that when coming across discrepancy in data in adjacent columns or rows, discrepancies are visually highlighted (i.e., change in formula(s), for example) by green triangles in the cell and when these are clicked, there are options to address the discrepancy including correcting the error.
MS Access: In MS Access tables, data reduction was a key improvement in gaining efficiencies and helped in providing timely reports to leadership without losing accuracy. Subsequently, incomplete data was updated and reports republished after the data was complete and accurate.
As the organization I was working for was acquired by a global conglomerate, the storage capacity of data was a constraint with MS Access and corporate organization switched over to SAP to address data quality issues from satellite databases to revolutionize “one version of truth” concept throughout the company of 68,000 employees.
Importance of descriptive statistics when preparing data for the data mining process
Per Han et. al., (2012), knowledge about data descriptions, data attributes, data types (continuous or discrete), data sources, data quality, and their contexts provide a robust foundation for further data excavation. As descriptive statistics including visualization of current data is the first major step in data preparation, it is paramount to begin data mining with sufficient and necessary understanding of data descriptions.
Issues to consider during data integration.
Per Han et. al., (2012), some of the major constraints in data integration are:
Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques (3rd ed.). 44-56 Waltham, MA: Morgan Kaufmann.
Kandel, Sean. & Heer, Jeffrey, & Plaisant, Catherine & Kennedy, Jessie, & Ham, Frank van, & Riche, Nathalie Henry, & Weaver, Chris, & Lee, Bongshin, & Broadbeck, Dominique, & Buono, Paolo
(2011) Research directions in data wrangling: Visualizations and transformations for usable and credible data, Research Paper Information Visualization 0(0) 1-18.