Detailing your Data in SAS Studio Part 4: Matching the Meaningful Data

Living in the digital age means that data is collected en masse about everything. Why? Because this information is invaluable. With the right data, a business can grow their customer base, optimize daily operations, or even prevent disaster. However, the most insightful data is often split between multiple data sets and stored in a variety of formats. One customer might appear in several datasets as William, Will, Bill, and Billy, sometimes without a unique customer ID. How do you find and combine that customer's data? The answer: match codes.

In this post, I'll show you how to generate match codes effortlessly in SAS Studio with the point-and-click Match Codes step. Then, I'll use my final codes to join two tables and combine essential information. Click Read more to continue our data quality journey with SAS Studio Flows!

The Match Codes Step

The Match Codes step is one of three steps available for enhancing data quality in SAS Studio Flows. This step enables you to create encoded representations of text strings. Choose from a plentiful selection of match definitions, ranging from addresses to names and more. Additionally, you can customize the complexity of your match code by adjusting the sensitivity level. Each node can generate up to 10 match code columns for one input data set. Visit the documentation for more information on step capabilities.

The Match Codes step (and other data quality steps) are available with the SAS Studio Engineer license.

Scenario

PURCHASE is a sample table in the SASDQREF library that stores information related to transactions. Each row represents one purchase, and each purchase contains data like the customer's name, company, and address, along with their order date, purchase amount, and more. However, quite a bit of supplemental information is missing from the purchase records: city and state values are missing, so we don't have the customer's full address, and phone numbers are missing, so we can't contact the customer about their purchase.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Luckily, the missing information is stored in a related dataset. CONTACTS is another table in SASDQREF that contains mostly complete contact information for our customers.

We could supplement the missing purchase data by joining the PURCHASE and CONTACTS tables. Specifically, we can retrieve the non-missing contact information by matching each purchase to a customer. To test this, I’m using a Query step in a flow which is configured to do an inner join on PURCHASE.COMPANY=CONTACTS.COMPANY and PURCHASE.CONTACT=CONTACTS.CONTACT.

PURCHASE has 5,002 rows and CONTACTS has 3,276 rows. In theory, the join result would have 5,002 rows if we assume that every purchase was made by a customer that also appears in CONTACTS.

At first glance, my join results look promising. I see complete rows with full addresses and valid phone numbers. However, I can see that the purchase IDs are off. Purchase IDs begin at 1 and increase by 1 for each purchase, but my join results start with the ID 6 and then jump to 31. This means that several of my purchase records did not successfully join with their matching customer records. My result table has 925 rows in total, further proving that I'm missing a large percent of the purchase records.

Let's improve these results with match codes.

Match codes

Match codes are encoded representations of text strings. A match code is based on the source data's semantic type and the desired complexity of the final match code. This is useful for any situation where you need to identify values that are similar, but not exactly the same.

For example, a customer may appear as James Briggs in one table and Jim Briggs in another. These values are not equal. However, if I generated match codes for these values using the Name match definition and the same sensitivity (the default 85), the resulting match codes would be equal.

Therefore, we can use match codes for evaluating similar pieces of data in tasks like joins, where rows are typically matched based on equal values. This is exactly what we’ll do to join PURCHASE and CONTACTS more accurately.

Match definitions are available for several semantic types and variations of data. Some locales, such as ENUSA, also include generic 'Text' match definitions for strings that don't apply to any of the existing definitions. For more information on match codes, visit the QKB documentation.

I’ll create match codes for the COMPANY and CONTACT variables in DQREF.PURCHASE and DQREF.CONTACTS. I’ll edit my flow to connect Match Codes nodes to the source tables and select my desired QKB locale (ENUSA).

To configure this step, simply select a source column, select a match definition, and set the desired sensitivity. The default value is 85, but valid values range from 50 to 95. You can specify an output column name if desired. The default output column name will be InputColumnName_MCSensitivity.

After configuring the Match Codes nodes for each input table, I can review my results and ensure that match codes were generated as expected.

Both output tables now have the match columns COMPANY_MC85 and CONTACT_MC85. Now, we can update the Query step to do an inner join on

PURCHASE_MATCH.COMPANY_MC85=CONTACTS_MATCH.COMPANY_MC85 and PURCHASE_MATCH.CONTACT_MC85=CONTACTS_MATCH.CONTACT_MC85.

Immediately, the join results look more successful. The purchase ID starts at 1 and my output table has a total of 4,417 rows. This means that we could still be missing some matches. In the real world, I’d investigate which rows are missing, identify why they weren't joined, and improve my join method to achieve my full result.

For the sake of the demo, I'm going to investigate a bigger problem. If I look closer at the results, I see that several orders appear multiple times.

Why did this happen? I’ll edit my query to include PURCHASE.COMPANY, CONTACTS.COMPANY, PURCHASE.CONTACT, and CONTACTS.CONTACT for comparison.

Notice that for each purchase ID, the PURCHASE company and contact values stay the same, while the CONTACTS company and contact values change for each row. This indicates that there are duplicates in the CONTACTS table.

How can we handle this problem? The most efficient solution would be to remove the duplicates from CONTACTS with common entity resolution methods. Then, I can re-do the join after CONTACTS has been de-duplicated.

For this post, my first priority is getting an accurate count of the rows in this join. In other words, I want a count of the unique ID values in this dataset and I'd like to keep one row per purchase ID.

I'll use a Sort node to sort my results by ID, then remove rows with duplicate ID values. For more information on using the Sort step, visit the documentation or read my post on sorting methods in SAS Studio Flows.

After using this quick clean-up method, I can see that I successfully joined 3,186 rows. With the help of the Match Codes step, I was able to join three times as many records as I had in my initial join result. Though this result isn't perfect, it's an obvious improvement, and I could continue to improve by adjusting or adding additional match codes and updating my join methodology.

Summary

In this post, I’ve demonstrated how to use the Match Codes step to handle a tricky, imperfect join. If you want to learn more about improving your data quality, check out the other posts in this series:

Have you used match codes before? Do you find match codes useful for joins, entity resolutions in singular data tables, or something else? Share your thoughts, questions, and feedback below!

Find more articles from SAS Global Enablement and Learning here.

Detailing your Data in SAS Studio Part 4: Matching the Meaningful Data

Registration is open

SAS AI and Machine Learning Courses