Solved: Hash (hex64) to string: minimum string length

SarahDew · Posted 02-18-2021 08:24 AM

I have a set of data that need to be anonymised. I created a hash for each ID, which I then turn into a string. To be safe I set the string to length 200, but this was not very user friendly.

-EDIT-

The long string is not user-friendly because the first 5 columns containing anonymised data are either not completely visible, or are very wide when you make the content visible. It also becomes difficult to eyeball if two values are the same or not. In general, it is also not very appealing for the end-user to work with these daunting long strings. It is not so much an issue with the storage size of the dataset.

-EDIT-

I then saw that even with length 10 I could get away with it and still have a unique string for each original value (around 5 million IDs). Except for one case where two different IDs resulted in the same string. Setting the string length to 15 solved this issue.

Now I wonder if there is a way to know what I should set the minimum length to, to be safe, also when future records are added?

For example, for the below data, a string length of 4 is too short, while 5 would be enough. Is there a way to determine this minimum of 5 based on the input?

/*Sample data*/
data have (keep=ID);
length ID $11.;
call streaminit(123);
Min = 10000000000; Max = 99999999999;
do i = 1 to 1000;
	u = rand("Uniform");
   ID = min + floor((1+Max-Min)*u);
   output;
end;
;run;

/*Create hash and string*/
data want;
format ID $11. ID_hash $hex64. ID_short_string $4. ID_long_string $5. ID_not_user_friendly $200.;
	set have;
ID_hash = sha256(cat(ID));
ID_short_string = put(ID_hash,$hex64.);
ID_long_string = put(ID_hash,$hex64.);
ID_not_user_friendly = put(ID_hash,$hex64.);
run;

/*Check if equal unique values*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(ID_short_string)) as N_ID_Short
, count(distinct(ID_long_string)) as N_ID_Long
from want;
quit;

Kurt_Bremser · Posted 02-24-2021 09:41 AM

How many distinct ID's do you need to encode? If I wanted to keep the displayed length of an encoded value to a minimum while guaranteeing a clear distinction, I would work with a simple running count, a Zx. format that is just long enough, and keep a lookup table that is updated with every new encoding:

data have;
input ID $;
datalines;
ABC
DEF
GHI
;

data lookup;
length ID $8 enc 8;
stop;
run;

%macro encode(inds,outds);
data &outds.;
set &inds. end=done;
if 0 then set lookup nobs=maxnum;
retain maxkey;
if _N_ = 1
then do;
  maxkey = maxnum;
  declare hash l1 (dataset:"lookup");
  l1.definekey("ID");
  l1.definedata("ID","enc");
  l1.definedone();
  declare hash l2 (dataset:"lookup");
  l2.definekey("enc");
  l2.definedata("enc");
  l2.definedone();
end;
if l1.find() ne 0
then do; 
  do until (l2.check(key:maxkey) ne 0);
    maxkey + 1;
  end;
  enc = maxkey;
  rc = l1.add();
  rc = l2.add();
end;
if done then rc = l1.output(dataset:"lookup_new");
drop rc maxkey;
run;

proc sql noprint;
select int(log(nobs)) + 1 into :formlength trimmed
from dictionary.tables
where libname = "WORK" and memname = upcase("&outds.");
quit;

proc datasets lib=work nolist;
delete lookup;
change lookup_new=lookup;
modify &outds.;
format enc z&formlength..;
quit;

%mend;

%encode(have,encoded1);

data have2;
set have end=done;
output;
if _n_ = 1 /* add a random observation */
then do;
  ID = "JKL";
  output;
end;
run;

%encode(have2,encoded2);

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 02-18-2021 08:49 AM

I would not do it. Set the length of the hash string to 32, format it with $hex64., and stay with it.

As soon as your method throws a duplicate, you have to search how many bytes you need, and have to recreate all your data with the longer hash. The few bytes you save are not worth the hassle.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 02-18-2021 01:37 PM

It might help to define exactly what "not user friendly" means and how it is impacting your work.

SarahDew · Posted 02-19-2021 03:38 AM

Good point, see edit.

Kurt_Bremser · Posted 02-19-2021 03:58 AM

If it's only the display width you are concerned about, use a LENGTH of 32, but a HEX format with a shorter display length.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SarahDew · Posted 02-24-2021 08:52 AM

If I understand correctly you propose to apply the following (given the example data above).

That does look appealing from a user perspective, especially because I get only numbers in this string which they are used to from the original ID. But do I not run into the same issue of how long the display length should minimally be, to make sure each unique ID is also displayed with a unique anonimised string?

/*use a LENGTH of 32, but a HEX format with a shorter display length*/
data want;
length ID_string $32;
format ID $11. ID_hash $hex64. ID_string $hex10.;
	set have;
ID_hash = sha256(cat(ID));
ID_string = put(ID_hash,$hex64.);
run;

/*Check if displayed values are unique*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(put(ID_string,$hex5.))) as N_ID_S
, count(distinct(put(ID_string,$hex10.))) as N_ID_L
from want;
quit;

Kurt_Bremser · Posted 02-24-2021 09:41 AM

How many distinct ID's do you need to encode? If I wanted to keep the displayed length of an encoded value to a minimum while guaranteeing a clear distinction, I would work with a simple running count, a Zx. format that is just long enough, and keep a lookup table that is updated with every new encoding:

data have;
input ID $;
datalines;
ABC
DEF
GHI
;

data lookup;
length ID $8 enc 8;
stop;
run;

%macro encode(inds,outds);
data &outds.;
set &inds. end=done;
if 0 then set lookup nobs=maxnum;
retain maxkey;
if _N_ = 1
then do;
  maxkey = maxnum;
  declare hash l1 (dataset:"lookup");
  l1.definekey("ID");
  l1.definedata("ID","enc");
  l1.definedone();
  declare hash l2 (dataset:"lookup");
  l2.definekey("enc");
  l2.definedata("enc");
  l2.definedone();
end;
if l1.find() ne 0
then do; 
  do until (l2.check(key:maxkey) ne 0);
    maxkey + 1;
  end;
  enc = maxkey;
  rc = l1.add();
  rc = l2.add();
end;
if done then rc = l1.output(dataset:"lookup_new");
drop rc maxkey;
run;

proc sql noprint;
select int(log(nobs)) + 1 into :formlength trimmed
from dictionary.tables
where libname = "WORK" and memname = upcase("&outds.");
quit;

proc datasets lib=work nolist;
delete lookup;
change lookup_new=lookup;
modify &outds.;
format enc z&formlength..;
quit;

%mend;

%encode(have,encoded1);

data have2;
set have end=done;
output;
if _n_ = 1 /* add a random observation */
then do;
  ID = "JKL";
  output;
end;
run;

%encode(have2,encoded2);

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SarahDew · Posted 03-04-2021 06:01 AM

Thanks, that looks like a really nice approach. I see it also gives the same encoded value if the same value is added, which can be the case in my data. I have around 5 million unique IDs for one of the ID columns so there I will use Z7.

Just wondering if it is equally safe as a hash value if the initial counts are given to an ordered ID (alphabetically for strings or ascending numbers). If you figure out the first count it might be easier to guess who the next count belongs to. It might be best to make sure the original IDs are ordered randomly?

Kurt_Bremser · Posted 03-04-2021 06:30 AM

You could modify my code to use a random integer value in the range, and derive the range from nobs.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

whymath · Posted 02-19-2021 04:15 AM

It is unnecessary.

5 is proper for your demo data. However, like you said, records will add in the future. We could not sure if 5 is proper in that time.

Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Re: Hash (hex64) to string: minimum string length

Classroom Training Available!