I have a set of data that need to be anonymised. I created a hash for each ID, which I then turn into a string. To be safe I set the string to length 200, but this was not very user friendly.
-EDIT-
The long string is not user-friendly because the first 5 columns containing anonymised data are either not completely visible, or are very wide when you make the content visible. It also becomes difficult to eyeball if two values are the same or not. In general, it is also not very appealing for the end-user to work with these daunting long strings. It is not so much an issue with the storage size of the dataset.
-EDIT-
I then saw that even with length 10 I could get away with it and still have a unique string for each original value (around 5 million IDs). Except for one case where two different IDs resulted in the same string. Setting the string length to 15 solved this issue.
Now I wonder if there is a way to know what I should set the minimum length to, to be safe, also when future records are added?
For example, for the below data, a string length of 4 is too short, while 5 would be enough. Is there a way to determine this minimum of 5 based on the input?
/*Sample data*/
data have (keep=ID);
length ID $11.;
call streaminit(123);
Min = 10000000000; Max = 99999999999;
do i = 1 to 1000;
u = rand("Uniform");
ID = min + floor((1+Max-Min)*u);
output;
end;
;run;
/*Create hash and string*/
data want;
format ID $11. ID_hash $hex64. ID_short_string $4. ID_long_string $5. ID_not_user_friendly $200.;
set have;
ID_hash = sha256(cat(ID));
ID_short_string = put(ID_hash,$hex64.);
ID_long_string = put(ID_hash,$hex64.);
ID_not_user_friendly = put(ID_hash,$hex64.);
run;
/*Check if equal unique values*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(ID_short_string)) as N_ID_Short
, count(distinct(ID_long_string)) as N_ID_Long
from want;
quit;
How many distinct ID's do you need to encode? If I wanted to keep the displayed length of an encoded value to a minimum while guaranteeing a clear distinction, I would work with a simple running count, a Zx. format that is just long enough, and keep a lookup table that is updated with every new encoding:
data have;
input ID $;
datalines;
ABC
DEF
GHI
;
data lookup;
length ID $8 enc 8;
stop;
run;
%macro encode(inds,outds);
data &outds.;
set &inds. end=done;
if 0 then set lookup nobs=maxnum;
retain maxkey;
if _N_ = 1
then do;
maxkey = maxnum;
declare hash l1 (dataset:"lookup");
l1.definekey("ID");
l1.definedata("ID","enc");
l1.definedone();
declare hash l2 (dataset:"lookup");
l2.definekey("enc");
l2.definedata("enc");
l2.definedone();
end;
if l1.find() ne 0
then do;
do until (l2.check(key:maxkey) ne 0);
maxkey + 1;
end;
enc = maxkey;
rc = l1.add();
rc = l2.add();
end;
if done then rc = l1.output(dataset:"lookup_new");
drop rc maxkey;
run;
proc sql noprint;
select int(log(nobs)) + 1 into :formlength trimmed
from dictionary.tables
where libname = "WORK" and memname = upcase("&outds.");
quit;
proc datasets lib=work nolist;
delete lookup;
change lookup_new=lookup;
modify &outds.;
format enc z&formlength..;
quit;
%mend;
%encode(have,encoded1);
data have2;
set have end=done;
output;
if _n_ = 1 /* add a random observation */
then do;
ID = "JKL";
output;
end;
run;
%encode(have2,encoded2);
I would not do it. Set the length of the hash string to 32, format it with $hex64., and stay with it.
As soon as your method throws a duplicate, you have to search how many bytes you need, and have to recreate all your data with the longer hash. The few bytes you save are not worth the hassle.
It might help to define exactly what "not user friendly" means and how it is impacting your work.
If it's only the display width you are concerned about, use a LENGTH of 32, but a HEX format with a shorter display length.
If I understand correctly you propose to apply the following (given the example data above).
That does look appealing from a user perspective, especially because I get only numbers in this string which they are used to from the original ID. But do I not run into the same issue of how long the display length should minimally be, to make sure each unique ID is also displayed with a unique anonimised string?
/*use a LENGTH of 32, but a HEX format with a shorter display length*/
data want;
length ID_string $32;
format ID $11. ID_hash $hex64. ID_string $hex10.;
set have;
ID_hash = sha256(cat(ID));
ID_string = put(ID_hash,$hex64.);
run;
/*Check if displayed values are unique*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(put(ID_string,$hex5.))) as N_ID_S
, count(distinct(put(ID_string,$hex10.))) as N_ID_L
from want;
quit;
How many distinct ID's do you need to encode? If I wanted to keep the displayed length of an encoded value to a minimum while guaranteeing a clear distinction, I would work with a simple running count, a Zx. format that is just long enough, and keep a lookup table that is updated with every new encoding:
data have;
input ID $;
datalines;
ABC
DEF
GHI
;
data lookup;
length ID $8 enc 8;
stop;
run;
%macro encode(inds,outds);
data &outds.;
set &inds. end=done;
if 0 then set lookup nobs=maxnum;
retain maxkey;
if _N_ = 1
then do;
maxkey = maxnum;
declare hash l1 (dataset:"lookup");
l1.definekey("ID");
l1.definedata("ID","enc");
l1.definedone();
declare hash l2 (dataset:"lookup");
l2.definekey("enc");
l2.definedata("enc");
l2.definedone();
end;
if l1.find() ne 0
then do;
do until (l2.check(key:maxkey) ne 0);
maxkey + 1;
end;
enc = maxkey;
rc = l1.add();
rc = l2.add();
end;
if done then rc = l1.output(dataset:"lookup_new");
drop rc maxkey;
run;
proc sql noprint;
select int(log(nobs)) + 1 into :formlength trimmed
from dictionary.tables
where libname = "WORK" and memname = upcase("&outds.");
quit;
proc datasets lib=work nolist;
delete lookup;
change lookup_new=lookup;
modify &outds.;
format enc z&formlength..;
quit;
%mend;
%encode(have,encoded1);
data have2;
set have end=done;
output;
if _n_ = 1 /* add a random observation */
then do;
ID = "JKL";
output;
end;
run;
%encode(have2,encoded2);
You could modify my code to use a random integer value in the range, and derive the range from nobs.
It is unnecessary.
5 is proper for your demo data. However, like you said, records will add in the future. We could not sure if 5 is proper in that time.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.