BookmarkSubscribeRSS Feed
SAS Employee

Have you ever run into the need of hiding something in your data? If yes, SAS Studio has been able to provide an easy solution to your problem since version 2023.07 with the Mask Data step. With the Mask Data step you can hide and obfuscate data in three different ways:

  1.  Masking is often used by companies to hide partial information, such as all the numbers in credit card identifier except the last four digits (XXXX XXXX XXXX 1234)
  2. Hashing is a one-way encryption method and prevents anyone from analysing the obfuscated data. This Hashing is the transformation of data into a short fixed-length value (known as a digest) that represents the original string. Hashing is used for example in database indexing due to performance gains in using the hashed key than using the original value. Hashing functions are typically used in file comparison and detection of file corruption and tampering. You just create a hash for file A and file B and then compare if the hash is identical for both.
  3. Substitution enables you to analyze obfuscated data without compromising data integrity. The custom step enables you to implement substitution with a lookup table. You can specify a key column and a value column for each column that you want to substitute. If needed, the original data can always be restored by querying the lookup table.



Mask Data step is located in the Transform Data section of SAS Studio steps:




I created an example of using the Mask Data step to explore all three variations of data obfuscation.














There are a lot of masking definitions to choose from, and you can see them all explained here in SAS documentation:











In my first swimlane (btw. Swimlanes are very handy to define the running order in SAS Studio) I run mask data step with a definition called Mask Partial String. This works well with number fields such as the PHONE field in my sample data and would work great also for a credit card number as it will mask all but the four last digits of a string. It is very easy to configure as seen in the picture below:















Note that I chose to create a new column instead of replacing the existing column.

When running it, the result is easy to see in the data that all but last four digits have been replaced and obfuscated.











My second example is about hashing ie. Encrypting the data using using a one-way non-reversible cipher, such as CRC32 or SHA256.




















In the image we select Hashing as type and then choose the CONTACT column as target. I’m running both CRC32 and SHA256 algorithms to show the difference between the two methods.










Hashing with a more complex algorithm like SHA produces a longer hash as result that will take more space in storage. Use the one that suits your use case best.




The third example uses substitution as method and requires a lookup table to be used for the conversion. I use a simple lookup that contains only the given name of the CONTACT thus replacing the full name with only the given name.



Substitution uses key matching to find the matching rows between source and lookup – it does not use the QKB at all. Mask data step by default only has one input port, but when linking up another table it will add a separate port for the lookup table. If you want to make sure that you got the correct port you can use the Expand feature to actually show the port names.


Substitution requires three columns to be mapped to the masking step. First two are from the lookup table: 1) Key column is the one that matches your target data and 2) Value column is the value that will substitute the original value. Third column Substitute obviously is the target column that will be substituted with the lookup value.


























Note that here I’m ok with replacing the original value with the substitute. As we want to remove some information we only keep the first name of the CONTACT:



















I understand that these are not overly complicated operations to achieve by writing code and using SAS functions. I personally like the Mask Data step for the ease of use and the fact that it offers so many variations of data masking. By using the Mask Data step, Santa’s secrets will be safe… at least until the big night 😄


Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.


Register now!

Discussion stats
  • 0 replies
  • 1 in conversation