09-30-2016 04:17 PM
As part of my course curriculum for MS in Analytics program, I had an opportunity to discuss about Data Quality Strategy to improve data source quality issues. I wanted to share the discussion post with SAS community
As Data Scientists / Analysts, it is a well-established fact that ~60%-80% of our time is spent on understanding data quality and detecting and correcting data quality issues.
Per McCafferty (2015) survey of companies, those without centralized data quality plan and strategy had negative impact on their business performance compared to companies that had a well laid out data quality strategy. Today, software tools exist to detect and profile data quality at all the data touch points and report on data quality performance. In addition, organizations need sustained sponsorship, monitoring, and improvement of data quality metrics. Data Quality issues cost an estimated $600 Billion annually to US businesses per The Data Warehousing Institute according to Geiger (2004).
Source Data Quality Issues: Data Quality is suspect at best from disparate sources (relational databases, flat files, social media, video and audio files, 3rd party sources, demographic data etc.) to the organization’s data warehouse and the most frequent data errors include: missing data (51%), outdated information (48%), and inaccurate data (44%) per McCafferty (2015).
Poor Data Quality Causes: Causes include:
Data Quality Improvement Strategy: (Cantin, 2011; Couture, 2013)
Summary: Key to a successful data quality improvement strategy implementation is to balance organizational priorities, receiving buy-in from stakeholders and leadership team, access to needed resources including software tools and human resources, measuring and monitoring current state, establishing future state with timelines, ensuring organizational personnel are accountable for data quality improvement strategy. I understand that it is easier said than done. As data scientists, we need to gain trust and engagement from sponsors to get data quality improvement efforts in the right direction.
Please feel free to provide your valuable feedback or comments and how to improve this article. Thank you.
Cantin, M. (2011). Making your organization care about data quality. Business Intelligence Journal, 16(4), 43–52.
Couture, N. (2013). Implementing an enterprise data quality strategy. Business Intelligence Journal,
Geiger, Jonathan (2004) Data Quality Management: The Most Critical Initiative You Can Implement. SUGI 29 Conference Proceedings, 1-14 http://www2.sas.com/proceedings/sugi29/098-29.pdf
Johnson, Theodore (2004) Data Quality and Data Cleaning: An Overview, AT&T Labs Research
McCafferty, D. (2015, February 10). Why poor data quality is impacting profits. CIO Insight, p. 2.
10-06-2016 11:07 AM
Thank you very much for your question. I would love to present it at a SAS conference if there is sufficient interest in the content for the audience. Also data quality strategy could involve poka yoke (mistake proofing) to prevent dealing with data quality issues in the first place. So designing the data collection and data management design would be key.
Poka Yoke is a popular technique used in manufacturing to prevent defects from passing through as well as preventing for the defects to happen in the first place.
10-06-2016 11:15 AM
Yes!! I'm a big fan of Poka Yoke. I would be very interested to read more on applying it (and other lean principles) to data science.
I hope you do present at a SAS conference. I've been enjoying reading papers and presentations from conferences on sasCommunity.org, and hope to present someday myself.
10-07-2016 07:07 AM - edited 10-07-2016 07:35 AM
I will have an abstract for the paper and the paper on this topic by next conference in September, 2017 (Hopefully).