BookmarkSubscribeRSS Feed
jarno
SAS Employee

Have you ever run into the need of hiding something in your data? If yes, SAS Studio has been able to provide an easy solution to your problem since version 2023.07 with the Mask Data step. With the Mask Data step you can hide and obfuscate data in three different ways:

  1.  Masking is often used by companies to hide partial information, such as all the numbers in credit card identifier except the last four digits (XXXX XXXX XXXX 1234)
  2. Hashing is a one-way encryption method and prevents anyone from analysing the obfuscated data. This Hashing is the transformation of data into a short fixed-length value (known as a digest) that represents the original string. Hashing is used for example in database indexing due to performance gains in using the hashed key than using the original value. Hashing functions are typically used in file comparison and detection of file corruption and tampering. You just create a hash for file A and file B and then compare if the hash is identical for both.
  3. Substitution enables you to analyze obfuscated data without compromising data integrity. The custom step enables you to implement substitution with a lookup table. You can specify a key column and a value column for each column that you want to substitute. If needed, the original data can always be restored by querying the lookup table.

 

 

Mask Data step is located in the Transform Data section of SAS Studio steps:

01_sasstudio_steps.png

 

 

I created an example of using the Mask Data step to explore all three variations of data obfuscation.

02_mask_01.png

 

 

 

 

 

 

 

 

 

 

Masking

 

There are a lot of masking definitions to choose from, and you can see them all explained here in SAS documentation:

02_mask_021.png

 

 

 

 

 

 

 

 

 

In my first swimlane (btw. Swimlanes are very handy to define the running order in SAS Studio) I run mask data step with a definition called Mask Partial String. This works well with number fields such as the PHONE field in my sample data and would work great also for a credit card number as it will mask all but the four last digits of a string. It is very easy to configure as seen in the picture below:

02_mask_02.png

 

 

 

 

 

 

 

 

 

 

 

 

 

Note that I chose to create a new column instead of replacing the existing column.

When running it, the result is easy to see in the data that all but last four digits have been replaced and obfuscated.

02_mask_03.png

 

 

 

 

 

 

 

Hashing

 

My second example is about hashing ie. Encrypting the data using using a one-way non-reversible cipher, such as CRC32 or SHA256.

03_mask_01.png

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In the image we select Hashing as type and then choose the CONTACT column as target. I’m running both CRC32 and SHA256 algorithms to show the difference between the two methods.

03_mask_02.png

 

 

 

 

 

 

 

 

Hashing with a more complex algorithm like SHA produces a longer hash as result that will take more space in storage. Use the one that suits your use case best.

 

Substitution

 

The third example uses substitution as method and requires a lookup table to be used for the conversion. I use a simple lookup that contains only the given name of the CONTACT thus replacing the full name with only the given name.

 

04_substitute_01.png

Substitution uses key matching to find the matching rows between source and lookup – it does not use the QKB at all. Mask data step by default only has one input port, but when linking up another table it will add a separate port for the lookup table. If you want to make sure that you got the correct port you can use the Expand feature to actually show the port names.

04_substitute_04.png

Substitution requires three columns to be mapped to the masking step. First two are from the lookup table: 1) Key column is the one that matches your target data and 2) Value column is the value that will substitute the original value. Third column Substitute obviously is the target column that will be substituted with the lookup value.

 

04_substitute_02.png

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Note that here I’m ok with replacing the original value with the substitute. As we want to remove some information we only keep the first name of the CONTACT:

04_substitute_03.png

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I understand that these are not overly complicated operations to achieve by writing code and using SAS functions. I personally like the Mask Data step for the ease of use and the fact that it offers so many variations of data masking. By using the Mask Data step, Santa’s secrets will be safe… at least until the big night 😄

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Discussion stats
  • 0 replies
  • 399 views
  • 10 likes
  • 1 in conversation