Standardization Schemes versus Standardization Definitions: What's the Difference?

Last year, I posted a series on creating, applying, and converting standardization schemes using SAS Data Quality programming and the Quality Knowledge Base. If you’re familiar with SAS Data Quality tools, you’ve probably seen or used standardization definitions before. Every QKB locale includes a set of definitions for data cleansing tasks like standardization, entity resolution, casing, parsing, extraction, and more.

You may be wondering: what exactly are standardization schemes and standardization definitions? What’s the difference? Is there a difference? How do you use them? Is a scheme better than a definition (or vice versa)?

In this article, I’ll help you differentiate between schemes and definitions. As the name suggests, both tools are used for standardizing data. Though a scheme and a similar definition can produce the same results with the same input data, this is not always the case. The primary difference between schemes and definitions is that they’re performing different operations under the hood. Let’s take a closer look at both.

Standardization schemes

Standardization schemes are simple lookup tables used to standardize variations of the same value. A scheme contains a list of variations on data values and their associated standard value. When applied to a variable, a “find and replace” method is used to replace relevant values with the standard value.

It can be beneficial to create custom standardization schemes if your data includes unique variations or data types. In previous articles on PROC DQSCHEME, I created a scheme to standardize the appearance of car models and manufacturers in my data.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Today, I’ll explore a different scenario. In this example, I want to establish a standard for various city names in North Carolina. My original data includes quite a bit of variation in casing and spelling.

I’ll use PROC DQSCHEME to create a scheme, CITY_SCHEME, based on this data.

proc dqscheme data=nc_cities noqkb;
    create matchdef='City' var=city scheme=city_scheme locale='ENUSA';
run;

Then, I’ll apply the new scheme using PROC DQSCHEME once more.

proc dqscheme data=nc_cities out=nc_cities_scheme noqkb;
   apply scheme=city_scheme var=city;
run;

My original data looks a lot neater now! However, be mindful that custom schemes will only transform values that appear in the scheme. Let’s say I have a similar set of data for South Carolina city names.

What happens if I try to use my CITY_SCHEME on this data?

My custom scheme doesn't work here. Since schemes use a “find and replace” method, I’d have to update my scheme to include these values.

Note that in this section I’ve discussed custom standardization schemes. The QKB also comes with a set of existing schemes that standardize specific tokens in a string. These schemes are the foundation for QKB definitions.

Standardization definitions

Standardization definitions are more complex than schemes. Instead of using a lookup table, a standardization definition uses a combination of data quality operations to transform data. Input values are first parsed into separate tokens. For example, a person's whole name would be parsed into first, middle, and last name, plus any prefixes or suffixes. Then, each token is transformed individually according to the definition rules. Often, a standardization definition will include a standardization scheme for each token in a definition. At the end, every token is concatenated back into one output string.

I’ll return to my city name example. What if I simply apply the ‘City’ standardization definition to the NC city variable?

data nc_cities_stnd;
    set nc_cities;
    length def_out $15;
    def_out=dqstandardize(city, 'City', 'ENUSA');
    drop state;
run;

This works with my North Carolina data and gives a result identical to the previous. Now I’ll apply this definition to the SC city variable, which I could not successfully standardize before.

This time, I get my ideal result. The South Carolina city names were standardized without any updates needed.

I can play around with this definition and try to standardize a series of completely random cities as well.

The results aren’t perfect (I’d prefer city names to be spelled out instead of using acronyms) but it’s pretty good for an out-of-the-box definition.

Summary

I’ve demonstrated two different methods to standardize city names. In this example, I was able to produce similar results at least once with each methods. Despite this, it’s important to remember that standardization schemes and standardization definitions differ in methodology.

Standardization schemes are great for standardizing a specific data set, like I did with the NC city data, or standardizing values that don’t have out-of-the-box definitions, like I did with car data in my PROC DQSCHEME article. When you make a scheme, you build a lookup table that knows which values to replace and what standard value to replace them with. You can standardize values but you’re limited to only the values that are present in the scheme lookup table.

Standardization definitions are great for transforming input values by using DQ tools and definition rules. These rules can be applied to any input value, which makes standardization definitions flexible. However, you forfeit some control by using them. While there are a wide variety of standardization definitions available (see the English, United States definitions), they don’t cover every possible data type. These definitions aren’t perfect and can output a value that you don’t want depending on your standardization preference and the rules of the definition (which vary by type and locale). You can standardize any input value but you can’t control how the value is standardized, so you may have undesired output.

In this article, I aimed to demystify the difference between standardization schemes and standardization definitions. Though they sound similar, there’s quite a difference between these two tools! Visit the most recent QKB documentation to learn more about schemes and definitions. If you’re interested learning more about SAS Data Quality, check out my series on custom standardization schemes or my colleague’s series on coding for data quality in SAS Viya.

SAS Communities Library

Standardization Schemes versus Standardization Definitions: What's the Difference?

Standardization schemes

Standardization definitions

Summary

Manipulating Data in Base SAS® Part 7 - Standardize

Accessing standardized residuals: arima

Dealing with Pod Security Policies and Standards (in a SAS Viya Deploy...

Estimating the Standard Deviation of a Variable in a Finite Population

To standardize data or not to standardize data - that is the question

Follow Us

What is...

SAS Communities Library

Standardization Schemes versus Standardization Definitions: What's the Difference?

Standardization schemes

Standardization definitions

Summary

Our biggest data and AI event of the year.

SAS AI and Machine Learning Courses

Manipulating Data in Base SAS® Part 7 - Standardize

Accessing standardized residuals: arima

Dealing with Pod Security Policies and Standards (in a SAS Viya Deploy...

Estimating the Standard Deviation of a Variable in a Finite Population

To standardize data or not to standardize data - that is the question

Follow Us

What is...