Protecting PII with the Mask Data Step

1 Like

Personally Identifiable Information (PII) is the term used to describe data that can identify an individual. This includes information like names, addresses, email addresses, phone numbers, and more. It goes without saying that sensitive data should be protected from any potential bad actors to avoid problems like identity theft or fraud in the event of a data security breach. How can you protect sensitive data efficiently?

In this post, I'll show you how to use the Mask Data step in SAS Studio to hide sensitive information with ease. I'll cover three different data obfuscation methods available with this step: masking, hashing, and substitution. Keep reading to learn how to protect your data with only a few clicks!

The Mask Data Step

With the Mask Data step, you can mask sensitive data in SAS Studio Flows using three different obfuscation methods: masking, hashing, or substitution.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

You can choose which method best fits with your data privacy needs or your original data's sensitivity level. The details of each method will be discussed in further detail in their respective sections. Each node can generate up to 10 masked data columns (with any assortment/combination of methods) for one input data set. Visit the documentation for more information on step capabilities.

The Mask Data step is available with the SAS Studio Analyst license.

Scenario

In this post, I'll be using the CONTACTS sample table from the SASDQREF library. This table contains contact information for our customers, including their company, names, addresses, phone numbers, and more.

We'll use the Mask Data step to hide some of this sensitive information.

Masking

Data masking uses QKB standardization definitions to replace specific characters with asterisks. If you need to do simple masking, several general use definitions are available such as Mask All Characters, Mask All Characters Except First and Last, and Mask Digits. Definitions are also available for more specific tasks, like Mask Address Except Street Type, Mask E-mail Domain, and Mask E-mail Mailbox.

I'll use the masking method on CONTACT and ADDRESS values.

First, select a QKB locale on the Data Obfuscation tab. The default is the global locale, which contains definitions that can be applied to any data, regardless of locale. To get more specialized masking definitions (such as address masking), select a specific locale. I'll choose English (United States).

After selecting masking for the obfuscation method, select the appropriate data masking definition and mask column. I'll choose to Mask All Characters Except First and Last for the CONTACT column. I'll also choose to replace the existing column instead of creating a new column.

Under Additional Data Obfuscation, I'll add another Masking type obfuscation which uses the Mask Address Except Street Type definition on ADDRESS.

The results show that the respective mask definitions were applied appropriately. The majority of characters were masked using asterisks, while address numbers were masked with zeros. Additionally, the output includes a flag column for each transformation by default, displaying 1 if the column was transformed and 0 if not.

Hashing

Data hashing uses hashing algorithms to transform distinct values into random, nondescript character strings. The Mask Data step supports two types of hash algorithm: the Secure Hash Algorithm (SHA) and the Cyclic Redundancy Check (CRC). Regardless of the selected method, the output will be hashed and indistinguishable from its original presentation.

I'll update my flow to use the hashing method on the COMPANY column.

Under Additional Data Obfuscation, I'll select the Hashing obfuscation method, the SHA256 algorithm, and the COMPANY column.

The results show that each company name has been hashed into a 256-bit string which has no similarities to the original values.

Substitution

Data substitution replaces column data with similar but unrelated data from a provided lookup table. This hides the original data without making the masking obvious.

To use the substitution method, you will have to add a lookup table to your flow and connect it to the Mask Data step. For this post, I created a basic lookup table named DB_LOOKUP that will replace the existing DATABASE values with different database names. Note: this lookup table is very simple and handles values in the first ten rows of the CONTACTS table for the sake of the post example.

proc sql;
   create table db_lookup 
      (DATA char(20), 
      STANDARD char(20)); 
   insert into db_lookup 
      (DATA, STANDARD) 
      values("Oracle","MySQL") 
      values("MS SQL Server","PostgreSQL") 
      values("DBASE 5","Snowflake") 
      values("Filemaker Pro","Databricks") 
      values("FileMaker Pro","Databricks") 
      values("oracle","MySQL"); 
quit;

I'll update my flow to add DB_LOOKUP as another input to the Mask Data step.

Under Additional Data Obfuscation, I'll select the Substitution obfuscation method. Then, I'll select the key column (DATA) and value column (STANDARD) from the lookup table. I'll finish by selecting DATABASE as the substitute column from the original table. I'll also choose what to do in the event of a missing key: I want to return the original value, but you could also choose to return a missing value. Other output column options remain the same.

The results show that the database names were accurately substituted and flagged based on lookup table values. Note that row 5 has a missing value, so there was no change and the flag is 0.

Summary

In this post, I've demonstrated how to use the Mask Data step to mask, hash, and substitute sensitive data in a table. If you want to learn more about transforming your data in SAS Studio, check out some of my previous posts:

Have you ever needed to mask PII before? Would the Mask Data step help simplify your data obfuscation process? Share your thoughts, questions, and feedback below!

Find more articles from SAS Global Enablement and Learning here.

Protecting PII with the Mask Data Step

Registration is open

SAS AI and Machine Learning Courses