There is an old saying among analysts: garbage in, garbage out. It means that you can have the best model in the world, and the most perfect analytical process—but if the quality of your data is poor, your results are likely to be rubbish.
This is widely known. It is even widely accepted. The chief analyst of a major financial organization once told me they spend up to 90% of their time in preparing data and solving quality issues! Think about what that means for analyst productivity. We also often find that many people simply assume that their data quality is good, without ever asking the right questions. This is probably not wholly unreasonable since non-analysts want to simply get the right data and go. They don’t want to be worrying about data quality, or any of those other data preparation steps that aren’t part of their motivation or skill set.
Fortunately, with SAS Viya - like many other SAS products - it is possible to take data quality almost for granted. There are several reasons for that, and you really need to know about them.
1. SAS Viya incorporates and enhances many of the great data quality features that were available in SAS9
About 20 years ago, SAS acquired DataFlux, which at that time was a market leader in data quality. DataFlux’s quality algorithms were gradually incorporated into SAS9 Data Management solutions taking the platform very capable in from data quality perspective. When we started visioning the Viya platform, we knew that we couldn’t simply import the data quality framework from SAS 9 as it is, as the architecture was fundamentally different. However, we kept focus on embedding the data quality knowhow, reliable algorithms, and proven ways of working with data quality in SAS Viya.
2. There are lots of different ways to manage data quality in SAS Viya
SAS Viya has a number of tools and options for managing data quality. For example, SAS Data Studio has many data quality transformations. These include standardization, parsing, and duplicate record removal. There are even AI-based data quality suggestions that can help you to detect points of quality improvement. SAS is always transparent so data quality actions can be programmatically called from SAS code. All these data quality features are based on the SAS Quality Knowledge Base: an agreed repository of definitions for data quality. Unlike SAS9, all language locales are all included in Viya, so there is no need to pick and choose between them.
3. Data quality is now being introduced into SAS Studio Flow via data quality custom steps
Now that SAS Viya is moving towards SAS Studio for developers, the data quality features are gradually being introduced into Studio as well. This is happening via data quality custom steps, which are add-on capabilities that are sourced either direct from SAS, or via the SAS community. The first custom steps will have added enrichment capabilities. They include Verify & Geocode Addresses, Verify Phone Numbers and Verify Email addresses. Survivorship and Entity Resolution are in the pipeline.
4. SAS Information Catalog makes it possible to search for data across sources
One of the issues with data quality is being able to obtain and bring together data from multiple sources, especially heterogenous data sources such as databases and cloud data storage. The challenge is to find the data you want and need, especially if you don’t know exactly what data you are looking for, or where it is stored. The SAS Information Catalog, embedded into SAS Viya 4, makes this easy. It is based on data agents that extract metadata at regular intervals and provides a data content and quality overview with descriptive measures and data content driven graphs. It behaves a bit like a Google search to your own data and enables you to easily find all possible tables related to your search string, in prioritized order by ‘best match’.
5. SAS Lineage provides an overview of data items to show you connections between them
Lineage helps you understand the origins of data, and the relationships with other data objects. When you open up the tool from one data item, be it a table, data job or model, you can see all its connections. This helps you to understand the interconnections and dependencies in your data. This is important, because it shows you how changes in one piece of data or table can affect others, both upstream and downstream. It therefore helps you to understand the data ‘big picture’ more clearly.
This was a very brief introduction to the SAS data quality tools available in SAS Viya. You can find out more about the SAS approach to data quality here.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.