I have a set of data that need to be anonymised. I created a hash for each ID, which I then turn into a string. To be safe I set the string to length 200, but this was not very user friendly. -EDIT- The long string is not user-friendly because the first 5 columns containing anonymised data are either not completely visible, or are very wide when you make the content visible. It also becomes difficult to eyeball if two values are the same or not. In general, it is also not very appealing for the end-user to work with these daunting long strings. It is not so much an issue with the storage size of the dataset. -EDIT- I then saw that even with length 10 I could get away with it and still have a unique string for each original value (around 5 million IDs). Except for one case where two different IDs resulted in the same string. Setting the string length to 15 solved this issue. Now I wonder if there is a way to know what I should set the minimum length to, to be safe, also when future records are added? For example, for the below data, a string length of 4 is too short, while 5 would be enough. Is there a way to determine this minimum of 5 based on the input? /*Sample data*/
data have (keep=ID);
length ID $11.;
call streaminit(123);
Min = 10000000000; Max = 99999999999;
do i = 1 to 1000;
u = rand("Uniform");
ID = min + floor((1+Max-Min)*u);
output;
end;
;run;
/*Create hash and string*/
data want;
format ID $11. ID_hash $hex64. ID_short_string $4. ID_long_string $5. ID_not_user_friendly $200.;
set have;
ID_hash = sha256(cat(ID));
ID_short_string = put(ID_hash,$hex64.);
ID_long_string = put(ID_hash,$hex64.);
ID_not_user_friendly = put(ID_hash,$hex64.);
run;
/*Check if equal unique values*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(ID_short_string)) as N_ID_Short
, count(distinct(ID_long_string)) as N_ID_Long
from want;
quit;
... View more