BookmarkSubscribeRSS Feed

Protecting PII with the Mask Data Step

Started 6 hours ago by
Modified 6 hours ago by
Views 50

Personally Identifiable Information (PII) is the term used to describe data that can identify an individual. This includes information like names, addresses, email addresses, phone numbers, and more. It goes without saying that sensitive data should be protected from any potential bad actors to avoid problems like identity theft or fraud in the event of a data security breach. How can you protect sensitive data efficiently?

 

In this post, I'll show you how to use the Mask Data step in SAS Studio to hide sensitive information with ease. I'll cover three different data obfuscation methods available with this step: masking, hashing, and substitution. Keep reading to learn how to protect your data with only a few clicks!

 

The Mask Data Step

 

With the Mask Data step, you can mask sensitive data in SAS Studio Flows using three different obfuscation methods: masking, hashing, or substitution.

 

01_grbarn_mask_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

You can choose which method best fits with your data privacy needs or your original data's sensitivity level. The details of each method will be discussed in further detail in their respective sections. Each node can generate up to 10 masked data columns (with any assortment/combination of methods) for one input data set. Visit the documentation for more information on step capabilities.

 

The Mask Data step is available with the SAS Studio Analyst license.

 

Scenario

 

In this post, I'll be using the CONTACTS sample table from the SASDQREF library. This table contains contact information for our customers, including their company, names, addresses, phone numbers, and more.

 

02_grbarn_mask_2.png

 

We'll use the Mask Data step to hide some of this sensitive information.

 

03_grbarn_mask_3.png

 

Masking

 

Data masking uses QKB standardization definitions to replace specific characters with asterisks. If you need to do simple masking, several general use definitions are available such as Mask All Characters, Mask All Characters Except First and Last, and Mask Digits. Definitions are also available for more specific tasks, like Mask Address Except Street Type, Mask E-mail Domain, and Mask E-mail Mailbox.

 

04_grbarn_mask_4.png

 

I'll use the masking method on CONTACT and ADDRESS values.

 

05_grbarn_mask_5.png

 

First, select a QKB locale on the Data Obfuscation tab. The default is the global locale, which contains definitions that can be applied to any data, regardless of locale. To get more specialized masking definitions (such as address masking), select a specific locale. I'll choose English (United States).

 

06_grbarn_mask_6-1024x538.png

 

After selecting masking for the obfuscation method, select the appropriate data masking definition and mask column. I'll choose to Mask All Characters Except First and Last for the CONTACT column. I'll also choose to replace the existing column instead of creating a new column.

 

Under Additional Data Obfuscation, I'll add another Masking type obfuscation which uses the Mask Address Except Street Type definition on ADDRESS.

 

07_grbarn_mask_7-1024x409.png

 

The results show that the respective mask definitions were applied appropriately. The majority of characters were masked using asterisks, while address numbers were masked with zeros. Additionally, the output includes a flag column for each transformation by default, displaying 1 if the column was transformed and 0 if not.

 

Hashing

 

Data hashing uses hashing algorithms to transform distinct values into random, nondescript character strings. The Mask Data step supports two types of hash algorithm: the Secure Hash Algorithm (SHA) and the Cyclic Redundancy Check (CRC). Regardless of the selected method, the output will be hashed and indistinguishable from its original presentation.

 

08_grbarn_mask_8.png

 

I'll update my flow to use the hashing method on the COMPANY column.

 

09_grbarn_mask_9-1024x429.png

 

Under Additional Data Obfuscation, I'll select the Hashing obfuscation method, the SHA256 algorithm, and the COMPANY column.

 

10_grbarn_mask_10.png

 

The results show that each company name has been hashed into a 256-bit string which has no similarities to the original values.

 

Substitution

 

Data substitution replaces column data with similar but unrelated data from a provided lookup table. This hides the original data without making the masking obvious.

 

11_grbarn_mask_11.png

 

To use the substitution method, you will have to add a lookup table to your flow and connect it to the Mask Data step. For this post, I created a basic lookup table named DB_LOOKUP that will replace the existing DATABASE values with different database names. Note: this lookup table is very simple and handles values in the first ten rows of the CONTACTS table for the sake of the post example.

 

12_grbarn_mask_12.png

 

proc sql;
   create table db_lookup 
      (DATA char(20), 
      STANDARD char(20)); 
   insert into db_lookup 
      (DATA, STANDARD) 
      values("Oracle","MySQL") 
      values("MS SQL Server","PostgreSQL") 
      values("DBASE 5","Snowflake") 
      values("Filemaker Pro","Databricks") 
      values("FileMaker Pro","Databricks") 
      values("oracle","MySQL"); 
quit;

 

13_grbarn_mask_13.png

 

I'll update my flow to add DB_LOOKUP as another input to the Mask Data step.

 

14_grbarn_mask_14-1024x625.png

 

Under Additional Data Obfuscation, I'll select the Substitution obfuscation method. Then, I'll select the key column (DATA) and value column (STANDARD) from the lookup table. I'll finish by selecting DATABASE as the substitute column from the original table. I'll also choose what to do in the event of a missing key: I want to return the original value, but you could also choose to return a missing value. Other output column options remain the same.

 

15_grbarn_mask_15.png

 

The results show that the database names were accurately substituted and flagged based on lookup table values. Note that row 5 has a missing value, so there was no change and the flag is 0.

 

Summary

 

In this post, I've demonstrated how to use the Mask Data step to mask, hash, and substitute sensitive data in a table. If you want to learn more about transforming your data in SAS Studio, check out some of my previous posts:

 

 

Have you ever needed to mask PII before? Would the Mask Data step help simplify your data obfuscation process? Share your thoughts, questions, and feedback below!

 

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
6 hours ago
Updated by:

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started