We’re smarter together. Learn from this collection of community knowledge and add your expertise.

9 ways to standardize data using SAS

by SAS Super FREQ on ‎12-01-2014 02:34 PM - edited on ‎10-05-2015 04:00 PM by Community Manager (2,386 Views)

With a piping hot cup of coffee in hand, you’re ready to dive into your day and tackle a new project. First on the agenda? Design a process to convert all variations of US state names to their associated United States Postal Service standard two-character abbreviation. It’s the simplest of transformations—take one word or phrase and transform it into a known standard. Unfortunately, real world data can make this easy manipulation difficult. You grab another cup of coffee.

 

While plopping down back to your seat, you think about the crazy misspellings or rearranged text your program will encounter when it begins to evaluate US state names. Luckily different SAS applications give you many options to minimize the amount of work needed to achieve high quality results with your address data. Let’s look at the pros and cons of your choices when it comes to transforming data in a SAS environment – starting with the most robust solutions first.

 

  1. Standardization definitions and schemes in the SAS Data Management Platform
    Standardization definitions are special algorithms made available to various SAS products when they are integrated with the SAS Quality Knowledge Base (QKB). Standardization definitions are very powerful. They can contain domain-specific and locale-specific parsing, regular expression, transformation, and casing logic. Rather than relying on straight value replacement, standardization definitions can apply knowledge about the type of data under evaluation to transform the value into the correct form. In fact, variations of non-standard state data are built into each definition so you don’t have to supply this list yourself.

    Standardization schemes also transform data but they don’t have the power or flexibility of standardization definitions. They don’t parse data first and can’t use the same regular expressions or sophisticated casing logic that standardization definitions do.

    Both standardization definitions and schemes can be applied to data using standardization nodes in SAS Data Management Platform jobs. The nodes will ask for input data and return the standardized value as output based on the locale and definition you choose. Standardizations can be integrated within the rest of your data management job.

    If you need to make changes to a standardization definition, there is a QKB Customize module in the SAS Data Management Platform for that very purpose. It can be used to update and maintain the centralized QKB repository referenced by all SAS applications where QKB-based functionality is available.

     

  2. Standardization Transformations in SAS Data Integration Studio
    SAS Data Integration Studio offers two standardization transformations after SAS Data Quality has been installed, and both standardize values using the QKB:
    • Apply Lookup Standardization – uses schemes only
    • Standardize with Definition – uses standardization definitions, which provide additional functionality as we have discussed.
    Working with standardization definitions and schemes in SAS Data Integration Studio is much like using them in the SAS Data Management Platform. If your data workflow needs to incorporate other SAS functionality like analytic procedures, it may be best to do this kind of cleansing work using generated SAS code rather than in SAS Data Management Platform jobs. You can then use the SAS Data Management Platform as an enterprise Data Management server for delivering your standardizations as batch or real-time processes.

     

  3. Standardization definitions in SAS Data Quality Server through Base SAS
    Did you know that data quality transformations are available through the SAS Data Quality Server? There’s a PROC that uses QKB schemes and functions that use QKB standardization definitions, like the one available for US state values. Using standard SAS code, either of these techniques will convert varied US state names to the correct abbreviation.

    As a reminder, standardization definitions use parse logic and other advanced features for better results – while a scheme is a more basic transformation of words or phrases.

     

  4. SAS Data Quality Accelerator for Teradata
    It’s possible to generate standardized data directly within a database or distributed file system without having to move the data to a separate processing engine. The SAS Data Quality Accelerator for Teradata is the first product that makes QKB-based transformations available as functions inside the database. The main benefit of this approach is improved performance for large volumes of data since the data does not have to be moved out of the database to be processed and does not have to be retuned there when the processing is complete. Of course the downside is that only Teradata is supported right now though support for Hadoop is in the works.

     

  5. Expression node access to the QKB in the SAS Data Management Platform
    If you have the QKB configured for use with the SAS Data Management Platform, you can access QKB definitions from task-specific nodes. You can also access the definitions from the Expression node. In this node, you will find data quality functions that let you apply QKB definitions directly through expression code.

    When will this come in handy? Rather than applying the definition to every row of data, your expression code selectively applies a standardization definition to a subset of data or rows that contain a certain combination of values in other columns.

    Note that there is some coding involved with this option. You have to initialize the bluefusion() object and provide parameters to load the correct locale and definitions and then invoke the proper QKB function passing in a definition reference and data values to be standardized.

     

  6. SAS code
    The SAS programming language provides robust transformation capabilities through DATA Step code. To standardize US state values, you could write DATA step code that substitutes standard US state abbreviations for all variations of state names that you can think of.

    You could then create a SAS macro to make this transformation code available to other SAS programs. SAS users have been doing this type of work for decades. While this approach provides flexibility (i.e., you can run the process in other applications like SAS Data Integration Studio or SAS Enterprise Guide) the actual standardization values are buried in SAS programs, which is less than ideal. Also, you will not enjoy the sophisticated algorithms available in standardization definitions, unless you author those yourself in SAS.

     

  7. Expression node code in the SAS Data Management Platform
    Even without the QKB, you can write expression code in the Expression node to transform data. Your code could look for specific words or phrases and substitute the standard US state values where appropriate.

    However, just like using SAS code, the list of values you have to look for will be large and difficult to manage. It will also be very difficult to replicate the advanced algorithms that QKB definitions bring. Finally, all the expression code you write is marooned in this one SAS Data Management Platform job, so sharing it and maintaining it can be problematic.

     

  8. Reference Data Manager in the SAS Data Management Platform
    SAS Web Studio contains a component called Reference Data Manager. Reference Data Manager lets you create domains, which are lists of values that can be used as input to business rules. For example, you could have a lookup domain called “US States” that contains pair values for state name to abbreviation transformations. Once the domain is populated, it can be used in SAS Data Management Platform jobs through the Domains and Domain Items nodes.

    You can access non-standardized data in your job, compare the values to the US State reference data domain, read the standardized value from the domain, and update the value to its standard representation.

    The nice thing about Reference Data Manager is that it puts domain list management in the hands of business users with its browser-based and easy-to-use interface. The reference data repository also tracks item version information in case you need to revert to an older list of values.

    Limitations include manual creation of large lists of non-standard values that are mapped to standard values. You don’t get any of the QKB-based content and benefits that make the standardization process smarter and less manual.

     

  9. Table joins
    Last on my list are good old-fashioned table joins. You can use this approach with Base SAS, SAS Data Integration Studio, SAS Enterprise Guide or SAS Data Management Platform.

    You can develop a text file, table or a SAS DATA set that contains all the standard US state abbreviations and associated variations of full state names. Then use table join logic to append the standard value to your existing table where a matching non-standard value is found.

    Like the other solutions mentioned here, the maintenance of the external US state standard table can be difficult and error-prone. There is no managed process for updating the external table with additional non-standard values. And you’ll need to build governance policies to determine who can update the values in this table.

 

The good news? You have options.

Using US state values as an example, we explored many data standardization approaches. Know that the same tools and techniques can be applied to other data and to other processes beyond data standardization. If you aren’t familiar with the QKB, take a look at what it offers for standardization, parsing, match code generation, identification analysis and more.


With a dedicated Customize application that lets you extend standard definitions and add new ones, you have complete control over the quality of your data. Even better, the QKB integrates with several SAS applications so you can choose the execution environment that is right for you. The core difference lies in having a centralized repository of standardization rules versus keeping transformation rules buried in custom code, and scattered across the enterprise.


Are you now ready to start your standardization project?

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.