BookmarkSubscribeRSS Feed
Murali11
Obsidian | Level 7

Introduction:

As part of my course curriculum for MS in Analytics program, I had an opportunity to discuss about Data Quality Strategy to improve data source quality issues. I wanted to share the discussion post with SAS community

As Data Scientists / Analysts, it is a well-established fact that ~60%-80% of our time is spent on understanding data quality and detecting and correcting data quality issues.

Per McCafferty (2015) survey of companies, those without centralized data quality plan and strategy had negative impact on their business performance compared to companies that had a well laid out data quality strategy. Today, software tools exist to detect and profile data quality at all the data touch points and report on data quality performance. In addition, organizations need sustained sponsorship, monitoring, and improvement of data quality metrics. Data Quality issues cost an estimated $600 Billion annually to US businesses per The Data Warehousing Institute according to Geiger (2004).

Source Data Quality Issues: Data Quality is suspect at best from disparate sources (relational databases, flat files, social media, video and audio files, 3rd party sources, demographic data etc.) to the organization’s data warehouse and the most frequent data errors include: missing data (51%), outdated information (48%), and inaccurate data (44%) per McCafferty (2015).

Poor Data Quality Causes: Causes include:

  • inconsistencies, typos, non-standard format, duplicate entries not verified, and measurement errors.
  • data transmission issues between data sources and data targets.
  • Data retrieval step issues include: not understanding source data, intent of null values, details about missing data, and incompatible source data format. (Johnson, 2004)

Data Quality Improvement Strategy: (Cantin, 2011; Couture, 2013)

  1. Establish Current State: Profile data to understand its current state and visualize data statistics with graphs and charts for clarity for e.g., lost revenue, profitability impact, sales and manufacturing impact, regulatory impact etc.
  2. Request Sponsorship: Request audience from organizational senior leadership team including IT, and business process stakeholders to receive sponsorship and support by presenting current state statistics and data quality graphs with an improvement proposal that includes:
    1. data quality improvement organization structure and resources including Responsible, Accountable, Consult, and Inform (RACI) matrix with role description definitions (data quality stewards, data administrators, data entry personnel etc.) and
    2. provide cost-benefit-risk analysis scenarios and propose timeline for data quality improvement.
    3. Improve strategy based on leadership feedback.
  3. Communicate: Communicate Data Quality current state at touch points e.g., data sources, data retrieval, data entry, IT, suppliers (if 2nd and 3rd party sources are involved) and provide road map for Data Quality Improvement Strategy.
  4. Develop Organizational Accountabilities: Establish business process owner accountabilities and rationale (providing measurable requirements and metrics to data sources) and data source accountabilities (demonstrating compliance with business process requirements and metrics) to establish communication channels to improve data quality.
  5. Measure and Monitor: Review industry benchmarks and survey results (such as McCafferty, 2015) on profitability and cost savings, new customer business and prospect data reliability improvements (profit improvement due to better sales and prospect data, and cost avoidance due to accurate inventory of SKU etc.) and communicate benefits of data quality improvement strategy.
  6. Corrective and Preventive Strategy: Data Quality Improvement Strategy should include both corrective and preventive measures such as demonstration of improvement in data quality (consistency, completeness, timeliness, accuracy etc.) and trajectory.
  7. Data Quality Metrics Dashboard: Establish Data Quality metrics for target condition after improvement for Data Accuracy, Consistency, Completeness, Timeliness, through different stages of data touch points from data definition through data interpretation.
  8. Ongoing Communication: Provide periodic updates (weekly, bi-weekly, and monthly) to sponsors and teams that are correcting data quality issues, and preventing them. and communicate target condition vs. actual metrics during these meetings with stakeholders, data sources, end users, and business process owners to gain common understanding and communication.
  9. Continuous Engagement: Entertain questions and dialogs from organizational sources to engage them in improvement strategy.

Summary: Key to a successful data quality improvement strategy implementation is to balance organizational priorities, receiving buy-in from stakeholders and leadership team, access to needed resources including software tools and human resources, measuring and monitoring current state, establishing future state with timelines, ensuring organizational personnel are accountable for data quality improvement strategy. I understand that it is easier said than done. As data scientists, we need to gain trust and engagement from sponsors to get data quality improvement efforts in the right direction.    

Please feel free to provide your valuable feedback or comments and how to improve this article. Thank you.                                                         

References:

Cantin, M. (2011). Making your organization care about data quality. Business Intelligence Journal, 16(4), 43–52.

Couture, N. (2013). Implementing an enterprise data quality strategy. Business Intelligence Journal,

18(4), 46–51.

Geiger, Jonathan (2004) Data Quality Management: The Most Critical Initiative You Can Implement. SUGI 29 Conference Proceedings, 1-14 http://www2.sas.com/proceedings/sugi29/098-29.pdf

Johnson, Theodore (2004) Data Quality and Data Cleaning: An Overview, AT&T Labs Research

McCafferty, D. (2015, February 10). Why poor data quality is impacting profits. CIO Insight, p. 2.

5 REPLIES 5
paulkaefer
Lapis Lazuli | Level 10

Thanks for sharing your work. Do you plan to publish this somewhere? Perhaps a conference?

Murali1
Calcite | Level 5

Hi paulkaefer,

 

Thank you very much for your question. I would love to present it at a SAS conference if there is sufficient interest in the content for the audience. Also data quality strategy could involve poka yoke (mistake proofing) to prevent dealing with data quality issues in the first place. So designing the data collection and data management design would be key.

 

Poka Yoke is a popular technique used in manufacturing to prevent defects from passing through as well as preventing for the defects to happen in the first place. 

paulkaefer
Lapis Lazuli | Level 10

Yes!! I'm a big fan of Poka Yoke. I would be very interested to read more on applying it (and other lean principles) to data science.

 

I hope you do present at a SAS conference. I've been enjoying reading papers and presentations from conferences on sasCommunity.org, and hope to present someday myself.

Peter_C
Rhodochrosite | Level 12
Pity the Call for Papers at SAS Global Forum 2017 closed at beginning of this week......
Murali1
Calcite | Level 5

Mr. Peter,

 

I will have an abstract for the paper and the paper on this topic by next conference in September, 2017 (Hopefully). 

 

Thank you,

 

Murali

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 5323 views
  • 11 likes
  • 4 in conversation