BookmarkSubscribeRSS Feed

Standardization Schemes versus Standardization Definitions: What's the Difference?

Started ‎04-09-2024 by
Modified ‎04-09-2024 by
Views 259

Last year, I posted a series on creating, applying, and converting standardization schemes using SAS Data Quality programming and the Quality Knowledge Base. If you’re familiar with SAS Data Quality tools, you’ve probably seen or used standardization definitions before. Every QKB locale includes a set of definitions for data cleansing tasks like standardization, entity resolution, casing, parsing, extraction, and more.

 

You may be wondering: what exactly are standardization schemes and standardization definitions? What’s the difference? Is there a difference? How do you use them? Is a scheme better than a definition (or vice versa)?

 

In this article, I’ll help you differentiate between schemes and definitions. As the name suggests, both tools are used for standardizing data. Though a scheme and a similar definition can produce the same results with the same input data, this is not always the case. The primary difference between schemes and definitions is that they’re performing different operations under the hood. Let’s take a closer look at both.

 

Standardization schemes

 

Standardization schemes are simple lookup tables used to standardize variations of the same value. A scheme contains a list of variations on data values and their associated standard value. When applied to a variable, a “find and replace” method is used to replace relevant values with the standard value.

 

It can be beneficial to create custom standardization schemes if your data includes unique variations or data types. In previous articles on PROC DQSCHEME, I created a scheme to standardize the appearance of car models and manufacturers in my data.

 

grbarn_stnd_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Today, I’ll explore a different scenario. In this example, I want to establish a standard for various city names in North Carolina. My original data includes quite a bit of variation in casing and spelling.

 

grbarn_stnd_2.png

 

I’ll use PROC DQSCHEME to create a scheme, CITY_SCHEME, based on this data.

 

 

proc dqscheme data=nc_cities noqkb;
    create matchdef='City' var=city scheme=city_scheme locale='ENUSA';
run;

 

grbarn_stnd_3.png

 

Then, I’ll apply the new scheme using PROC DQSCHEME once more.

 

 

proc dqscheme data=nc_cities out=nc_cities_scheme noqkb;
   apply scheme=city_scheme var=city;
run;

 

 

grbarn_stnd_4.png

 

My original data looks a lot neater now! However, be mindful that custom schemes will only transform values that appear in the scheme. Let’s say I have a similar set of data for South Carolina city names.

 

grbarn_stnd_5.png

 

What happens if I try to use my CITY_SCHEME on this data?

 

grbarn_stnd_6.png

 

My custom scheme doesn't work here. Since schemes use a “find and replace” method, I’d have to update my scheme to include these values.

 

Note that in this section I’ve discussed custom standardization schemes. The QKB also comes with a set of existing schemes that standardize specific tokens in a string. These schemes are the foundation for QKB definitions.

 

Standardization definitions

 

Standardization definitions are more complex than schemes. Instead of using a lookup table, a standardization definition uses a combination of data quality operations to transform data. Input values are first parsed into separate tokens. For example, a person's whole name would be parsed into first, middle, and last name, plus any prefixes or suffixes. Then, each token is transformed individually according to the definition rules. Often, a standardization definition will include a standardization scheme for each token in a definition. At the end, every token is concatenated back into one output string.

 

I’ll return to my city name example. What if I simply apply the ‘City’ standardization definition to the NC city variable?

 

 

data nc_cities_stnd;
    set nc_cities;
    length def_out $15;
    def_out=dqstandardize(city, 'City', 'ENUSA');
    drop state;
run;

 

grbarn_stnd_7.png

 

This works with my North Carolina data and gives a result identical to the previous. Now I’ll apply this definition to the SC city variable, which I could not successfully standardize before.

 

grbarn_stnd_8.png

 

This time, I get my ideal result. The South Carolina city names were standardized without any updates needed.

 

I can play around with this definition and try to standardize a series of completely random cities as well.

 

grbarn_stnd_9.png

 

The results aren’t perfect (I’d prefer city names to be spelled out instead of using acronyms) but it’s pretty good for an out-of-the-box definition.

 

Summary

 

I’ve demonstrated two different methods to standardize city names. In this example, I was able to produce similar results at least once with each methods. Despite this, it’s important to remember that standardization schemes and standardization definitions differ in methodology.

 

Standardization schemes are great for standardizing a specific data set, like I did with the NC city data, or standardizing values that don’t have out-of-the-box definitions, like I did with car data in my PROC DQSCHEME article. When you make a scheme, you build a lookup table that knows which values to replace and what standard value to replace them with. You can standardize values but you’re limited to only the values that are present in the scheme lookup table.

 

Standardization definitions are great for transforming input values by using DQ tools and definition rules. These rules can be applied to any input value, which makes standardization definitions flexible. However, you forfeit some control by using them. While there are a wide variety of standardization definitions available (see the English, United States definitions), they don’t cover every possible data type. These definitions aren’t perfect and can output a value that you don’t want depending on your standardization preference and the rules of the definition (which vary by type and locale). You can standardize any input value but you can’t control how the value is standardized, so you may have undesired output.

 

In this article, I aimed to demystify the difference between standardization schemes and standardization definitions. Though they sound similar, there’s quite a difference between these two tools! Visit the most recent QKB documentation to learn more about schemes and definitions. If you’re interested learning more about SAS Data Quality, check out my series on custom standardization schemes or my colleague’s series on coding for data quality in SAS Viya.

Version history
Last update:
‎04-09-2024 10:18 AM
Updated by:
Contributors

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels