An off-topic spot to chat about your musings of the day

Data Preprocessing

Occasional Contributor
Posts: 16

Data Preprocessing

[ Edited ]

As part of my course curriculum for MS in Analytics program, there was an assignment to discuss different aspects of data preprocessing. As data preprocessing consumes considerable amount of time in the entire Analytic Life Cycle, I thought of sharing it with SAS Community.


Different Aspects of Data preprocessing include:


Best Practices of data preprocessing:


Analysts work through “dirty data quality issues” in data mining projects be they, noisy (inaccurate), missing, incomplete, or inconsistent data. Before embarking on data mining process, it is prudent to verify that data is clean to meet organizational processes and clients’ data quality expectations.


Kandel, et., al (2011) in their research paper have discussed methodologies that could revolutionize how data wrangling is performed. In their paper, the authors call data quality issues as “elephant in the room”. Per Geiger (2004), at the time, poor data quality costed $600 Billion annually across American companies, as quoted by Data Warehousing Institute. Some authors estimate the time and efforts consumed due to poor data quality accounts for 60-80% of data mining efforts.


Data Quality Improvement Practices:


The practices that are discussed in Kandel, et., al (2011) paper addresses:


  • Effective data wrangling with new interactive systems that combine data verification, data transformation, and data visualization;
  • Visual encoding represents missing data in line charts, for example showing gaps in line graph;
  • Role of interactive visualization in devising data transform specification;
  • Visualization of uncertainty in geospatial analysis with a statistical model;
  • Recorded data source and socializing results in widespread repetitive use, checking, and revision of data transformations.

Data Transformation:


Per Kandel (2011), data transformations follow visualizations similar to visualization of data quality issues including reformatting, extraction, outlier correction, and schema mapping. Due to constant updating of data and schemas, constant editing and auditing of data transformation is needed. So the data output, the author argues is not just transformed data, but rather editable and auditable description of data transformations applied to improve repeatability and modification.


Data Integration:

Corrected or “clean data” from different sources and types are integrated utilizing graphical tools with schema matching, with algorithms and interactive tool development.


Data Reduction:

Per Han et., al (2012), data reduction is applied to gain efficiencies in providing faster analytical results. Strategies include dimensionality reduction (random variable reduction), Numerosity reduction (alternative smaller forms of data), and data compression (Transformation applied to “compressed” version of original data without losing information from original data. The authors caution the efforts spent in data reduction should not exceed benefits gained by it.


Examples of my experience in processing data from different databases.


I have used the following databases to process data:


Microsoft Excel: I have noticed that when coming across discrepancy in data in adjacent columns or rows, discrepancies are visually highlighted (i.e., change in formula(s), for example) by green triangles in the cell and when these are clicked, there are options to address the discrepancy including correcting the error.


MS Access: In MS Access tables, data reduction was a key improvement in gaining efficiencies and helped in providing timely reports to leadership without losing accuracy. Subsequently, incomplete data was updated and reports republished after the data was complete and accurate.


As the organization I was working for was acquired by a global conglomerate, the storage capacity of data was a constraint with MS Access and corporate organization switched over to SAP to address data quality issues from satellite databases to revolutionize “one version of truth” concept throughout the company of 68,000 employees.


Importance of descriptive statistics when preparing data for the data mining process

Per Han et. al., (2012), knowledge about data descriptions, data attributes, data types (continuous or discrete), data sources, data quality, and their contexts provide a robust foundation for further data excavation. As descriptive statistics including visualization of current data is the first major step in data preparation, it is paramount to begin data mining with sufficient and necessary understanding of data descriptions.


Issues to consider during data integration.

Per Han et. al., (2012), some of the major constraints in data integration are:

  • When database schema is from different sources, they might vary in their meaning and interpretation of data values, thus resulting in a difficult entity identification problem; example: In one database a person’s social security number could be SSN while in another, it could be stated as Social Security Number. Without clear data dictionary and understanding, it could be quite challenging, especially if the number of variables are numerous.
  • Data structure could be another issue for data integration-When attributes from one database are matched with the attributes of another database, dependencies and constraints between source and target should match. Example: If a customer is classified as a “special” customer with privileges (e.g., product promotion communication alerts, discounts etc. in one system) and in another system if the same customer is allocated in a different way, then there is an issue.
  • Duplication of observations leading to data redundancy which lead to incorrect computation, resulting in incorrect data use, if it goes undetected.


Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques (3rd ed.). 44-56 Waltham, MA: Morgan Kaufmann.

Kandel, Sean. & Heer, Jeffrey, & Plaisant, Catherine & Kennedy, Jessie, & Ham, Frank van, & Riche, Nathalie Henry, & Weaver, Chris, & Lee, Bongshin, & Broadbeck, Dominique, & Buono, Paolo

(2011) Research directions in data wrangling: Visualizations and transformations for usable and credible data, Research Paper Information Visualization 0(0) 1-18.

Occasional Contributor
Posts: 16

Re: Data Preprocessing

Hi Chris Hemedinger,


Thank you for your review and feedback on this article. I appreciate it.


Murali Sastry

Ask a Question
Discussion stats
  • 1 reply
  • 1 like
  • 1 in conversation