@Ronein Just encoding via $hex64 doesn't cut it. Anyone who knows how you encoded the values can easily decode them.
But reading your original requirements again: So if "the other company" adds data (columns) to your table then they also need to know the customer. How else would they be capable to add the data to the right customer. If this is true then it's likely more about secure transmission of data and not about masking PII columns. Is that correct?
Note if the law is you cannot transmit those IDs then sending with such a simple encoding will probably not protect you from prosecution under the law.
I've developed some time ago a masking framework for masking PII columns in one environment and then transfer the masked tables to another environment (from PROD to DEV).
Important in this scenario was that all columns in the masked tables still had the same attributes (like data type and length). For this reason any approach using MD5, SHA etc. was not suitable.
In my case just using a sequence number for the masked data was no issue. And it's much easier to manage than using a random number because the moment you're generating the number randomly you need to also build a check that you haven't generated the same number already for another cleartext value. ...and because of the birthday paradigm the probability to do so is higher than one would intuitively assume.
If just using a sequence number the only thing you need to pay some attention to is to not use the algorithm on data that's sorted by the cleartext variable you want to mask (like a customer number that's often also "sequential").
To implement something fully dynamically that's also works for character variables with short length and alphanumerical strings takes a bit more but for your use case things can be kept rather simple.
It was for my use case important that the same cleartext value always resulted in the same masked values even when running the masking algorithm for multiple tables and tables from multiple months and at different times.
The only way this can work is to store and maintain a permanent masking lookup table with key-value pairs of cleartext and masked values.
Here what could work for you and meets all the above formulated requirements.
libname perm "%sysfunc(pathname(work))";
/* sample data */
data work.have;
do cust_id=4,1,99999,10,1,5,4;
obs_num+1;
output;
end;
run;
/* create permanent masking lookup table if not exists */
/* - key-value pairs of cleartext-masked values */
%if %sysfunc(exist(perm.cust_mask)) ne 1 %then
%do;
data perm.cust_masklookup(index=;
stop;
set work.have(keep=cust_id);
cust_id_masked=cust_id;
run;
%end;
/* create table with masked data */
data work.masked;
if _n_=1 then
do;
if 0 then set perm.cust_masklookup(rename=(cust_id_masked=__cust_id_masked));
/* permanent masking lookup table with data masked in previous runs */
dcl hash __h1(dataset:'perm.cust_masklookup(rename=(cust_id_masked=__cust_id_masked))');
__h1.defineKey('cust_id');
__h1.defineData('__cust_id_masked');
__h1.defineDone();
/* mask lookup table to collect with masked values created in this run */
dcl hash __h1_new(dataset:'perm.cust_masklookup(obs=0 rename=(cust_id_masked=__cust_id_masked))');
__h1_new.defineKey('cust_id');
__h1_new.defineData('cust_id','__cust_id_masked');
__h1_new.defineDone();
end;
call missing(of _all_);
set work.have end=last;
if __h1.find() ne 0 and __h1_new.find() ne 0 then
do;
__cust_id_masked=sum(__h1.num_items,__h1_new.num_items,76589);
__rc=__h1_new.add(key:cust_id, data:cust_id, data:__cust_id_masked);
end;
cust_id=__cust_id_masked;
/* write newly created key-value pairs of cleartext/masked values to work table */
if last then __h1_new.output(dataset:'work.cust_masklookup_new(rename=(__cust_id_masked=cust_id_masked))');
drop __:;
run;
/* append newly created masked values to permanent lookup table */
proc append base=perm.cust_masklookup data=work.cust_masklookup_new;
run;quit;
/* unmask values */
data work.un_masked;
if _n_=1 then
do;
if 0 then set perm.cust_masklookup(rename=(cust_id_masked=__cust_id_masked));
dcl hash h1(dataset:'perm.cust_masklookup(rename=(cust_id_masked=__cust_id_masked))');
h1.defineKey('__cust_id_masked');
h1.defineData('cust_id');
h1.defineDone();
end;
call missing(of _all_);
set work.masked;
if h1.find(key:cust_id) ne 0 then
do;
put 'Clear text key not found in masking lookup table';
put cust_id=;
abort;
end;
drop __:;
run;
/* print the various tables */
title 'PERM: cust_masklookup';
proc print data=perm.cust_masklookup_new;
run;
title 'have';
proc print data=work.have;
run;
title 'masked';
proc print data=work.masked;
run;
title 'un_masked';
proc print data=work.un_masked;
run;
title;
Please spend some time to understand above code before you start to ask questions about it.
Hi @Ronein
The SAS encryption algoritms will always generate the SAME KEY for a given value. This can be used in the following way:
1. Generate data to the external part with an encrypted key only.
2. Join your original and the returned data on the encrypted key, which you generate once more from the original ID in the join condition.
This way, you don't have and don't need a data set with both the original key and the encrypted key. I think that should satisfy all requirements.
* What you have;
data have;
ID = 123456;
Name = 'Ronein';
run;
* You make encrypted data to external part;
data xport; set have;
drop ID;
ID_encrypt = put(md5(cats(put(ID,12.))),$hex32.);
run;
* External part returns data with added information;
data return;
set xport;
Occupation = 'SAS Programmer';
run;
* You merge added infornmation onto orig data;
proc sql;
create table want as
select
a.ID,
a.Name,
b.Occupation
from have as a full outer join return as b
on put(md5(cats(put(a.ID,12.))),$hex32.) = b.ID_encrypt;
quit;
@ErikLund_Jensen wrote:
Hi @Ronein
The SAS encryption algoritms will always generate the SAME KEY for a given value. This can be used in the following way:
1. Generate data to the external part with an encrypted key only.
2. Join your original and the returned data on the encrypted key, which you generate once more from the original ID in the join condition.
This way, you don't have and don't need a data set with both the original key and the encrypted key. I think that should satisfy all requirements.
MD5 and SHA is ENCODING and not ENCRYPTION
Hi @Patrick
You are right, of course, and I bow in shame for being so imprecise in my wording.
But it doesn't make any difference here.
The external partner receives a non-reversible masked ID only,
there is no table containing both the original ID and the masked ID,
but he returned data can nevertheless be joined with the original data using the masked ID as key, because it can be recreated in the join-condition.
[EDIT:] (after sending this I saw that @Tom already wrote that "no encrypting" answer)
The question was already answered but the following also satisfies your 1,2, and 3 requirements:
Data have;
input cust_ID someValue $;
cards;
33948398 A
54897733 B
98376333 C
54897733 D
;
run;
proc sort data=have(keep=cust_ID) out=lookup nodupkey;
by cust_ID;
run;
data lookup;
set lookup;
masked_cust_ID + 1;
run;
proc print data=lookup;
run;
Since it is you who provides input data you have full control on the "masking" and no encryption function is needed.
All the best
Bart
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.