When you need to consistently anonymize data, you need to create a lookup dataset which contains the translation, and add new observations when new items arrive with new data. The best tool for this is the hash object, as long as the lookup can fit into memory.
For a practical example, show us some usable example data where you need to mask one or more columns.
There's an official standard for the masking and encryption of credit card data called the Card Payment Industry Data Security Standard (PCI DSS). Here is a useful link if you want to know more: https://listings.pcisecuritystandards.org/documents/PCI_DSS-QRG-v3_2_1.pdf
Please bear in mind that masking is different from encryption. Masking is just hiding part or all of a data value while encrypting is applying a complex algorithm to convert the data value into something completely different that cannot be easily reversed. The PCI DSS masking standard is to only display the first 6 and last 4 digits of a credit card number which is normally 16 digits long). So this is why you often see on printed credit card payment receipts : 1234 56** **** 1234.
Sounds like you are more into tokenization/encryption rather than masking (hiding characters).
I don't know where you work, but at larger organisations, chances are that there are already functions in place to this, in initiatives for creating test data, or protecting production data. Maybe you could look around?
For masking, some SW vendors offer this OOTB, like in SAS Federation Server, or Snowflake to mention a few.
If you need to solve this yourself, there are functions in SAS that you could use, like the different hash functions (md5, sha256 etc). These are not format preserving (meaning you need to change your table schema and potentially programs that use this data). If you need format preserving I suggest to look for a SW that does this for you (Fortanix is one).
@LinusH The challenge I've always been facing with masking approaches using some md5/sha or whatever is that the masked string very often doesn't fit into source variable length.
I'm normally using the approach @Kurt_Bremser proposes as not only is it really simple to augment a sequence number, it's also a suitable approach for numerical variables and it doesn't require any changes to variable attributes (type and length).
@Ksharp Thanks for sharing these links. Really useful if I ever have to generate alphanumeric masked strings.
...and what one of the blogs mentioned that I wouldn't have thought about for generating such strings: "Make sure there are no objectionable words in the set! " 😄
Maybe @Rick_SAS know the way to do this.
Here is a simple example form me by adding offset into very character.
data have;
input have $80.;
want=have;
do i=1 to length(have);
substr(want,i,1)=byte(rank(substr(have,i,1))+mod(i,5));
end;
cards;
Thanks for sharing these links.
Make sure there are no objectionable words in the set!
relating to production data masking, how do people do it in SAS
;
proc print;run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.