BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SarahDew
Obsidian | Level 7

I have a set of data that need to be anonymised. I created a hash for each ID, which I then turn into a string. To be safe I set the string to length 200, but this was not very user friendly.

-EDIT-

The long string is not user-friendly because the first 5 columns containing anonymised data are either not completely visible, or are very wide when you make the content visible. It also becomes difficult to eyeball if two values are the same or not. In general, it is also not very appealing for the end-user to work with these daunting long strings. It is not so much an issue with the storage size of the dataset.

-EDIT-

 

I then saw that even with length 10 I could get away with it and still have a unique string for each original value (around 5 million IDs). Except for one case where two different IDs resulted in the same string. Setting the string length to 15 solved this issue.

 

Now I wonder if there is a way to know what I should set the minimum length to, to be safe, also when future records are added?

 

For example, for the below data, a string length of 4 is too short, while 5 would be enough. Is there a way to determine this minimum of 5 based on the input?

 

/*Sample data*/
data have (keep=ID);
length ID $11.;
call streaminit(123);
Min = 10000000000; Max = 99999999999;
do i = 1 to 1000;
	u = rand("Uniform");
   ID = min + floor((1+Max-Min)*u);
   output;
end;
;run;

/*Create hash and string*/
data want;
format ID $11. ID_hash $hex64. ID_short_string $4. ID_long_string $5. ID_not_user_friendly $200.;
	set have;
ID_hash = sha256(cat(ID));
ID_short_string = put(ID_hash,$hex64.);
ID_long_string = put(ID_hash,$hex64.);
ID_not_user_friendly = put(ID_hash,$hex64.);
run;

/*Check if equal unique values*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(ID_short_string)) as N_ID_Short
, count(distinct(ID_long_string)) as N_ID_Long
from want;
quit;

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

How many distinct ID's do you need to encode? If I wanted to keep the displayed length of an encoded value to a minimum while guaranteeing a clear distinction, I would work with a simple running count, a Zx. format that is just long enough, and keep a lookup table that is updated with every new encoding:

data have;
input ID $;
datalines;
ABC
DEF
GHI
;

data lookup;
length ID $8 enc 8;
stop;
run;

%macro encode(inds,outds);
data &outds.;
set &inds. end=done;
if 0 then set lookup nobs=maxnum;
retain maxkey;
if _N_ = 1
then do;
  maxkey = maxnum;
  declare hash l1 (dataset:"lookup");
  l1.definekey("ID");
  l1.definedata("ID","enc");
  l1.definedone();
  declare hash l2 (dataset:"lookup");
  l2.definekey("enc");
  l2.definedata("enc");
  l2.definedone();
end;
if l1.find() ne 0
then do; 
  do until (l2.check(key:maxkey) ne 0);
    maxkey + 1;
  end;
  enc = maxkey;
  rc = l1.add();
  rc = l2.add();
end;
if done then rc = l1.output(dataset:"lookup_new");
drop rc maxkey;
run;

proc sql noprint;
select int(log(nobs)) + 1 into :formlength trimmed
from dictionary.tables
where libname = "WORK" and memname = upcase("&outds.");
quit;

proc datasets lib=work nolist;
delete lookup;
change lookup_new=lookup;
modify &outds.;
format enc z&formlength..;
quit;

%mend;

%encode(have,encoded1);

data have2;
set have end=done;
output;
if _n_ = 1 /* add a random observation */
then do;
  ID = "JKL";
  output;
end;
run;

%encode(have2,encoded2);

View solution in original post

9 REPLIES 9
Kurt_Bremser
Super User

I would not do it. Set the length of the hash string to 32, format it with $hex64., and stay with it.

As soon as your method throws a duplicate, you have to search how many bytes you need, and have to recreate all your data with the longer hash. The few bytes you save are not worth the hassle.

ballardw
Super User

It might help to define exactly what "not user friendly" means and how it is impacting your work.

 

SarahDew
Obsidian | Level 7
Good point, see edit.
SarahDew
Obsidian | Level 7

If I understand correctly you propose to apply the following (given the example data above).

 

That does look appealing from a user perspective, especially because I get only numbers in this string which they are used to from the original ID. But do I not run into the same issue of how long the display length should minimally be, to make sure each unique ID is also displayed with a unique anonimised string?

 

/*use a LENGTH of 32, but a HEX format with a shorter display length*/
data want;
length ID_string $32;
format ID $11. ID_hash $hex64. ID_string $hex10.;
	set have;
ID_hash = sha256(cat(ID));
ID_string = put(ID_hash,$hex64.);
run;

/*Check if displayed values are unique*/
proc sql;
create table counts as
select count(distinct(ID)) as n_ID
, count(distinct(put(ID_string,$hex5.))) as N_ID_S
, count(distinct(put(ID_string,$hex10.))) as N_ID_L
from want;
quit;

 

 

 

Kurt_Bremser
Super User

How many distinct ID's do you need to encode? If I wanted to keep the displayed length of an encoded value to a minimum while guaranteeing a clear distinction, I would work with a simple running count, a Zx. format that is just long enough, and keep a lookup table that is updated with every new encoding:

data have;
input ID $;
datalines;
ABC
DEF
GHI
;

data lookup;
length ID $8 enc 8;
stop;
run;

%macro encode(inds,outds);
data &outds.;
set &inds. end=done;
if 0 then set lookup nobs=maxnum;
retain maxkey;
if _N_ = 1
then do;
  maxkey = maxnum;
  declare hash l1 (dataset:"lookup");
  l1.definekey("ID");
  l1.definedata("ID","enc");
  l1.definedone();
  declare hash l2 (dataset:"lookup");
  l2.definekey("enc");
  l2.definedata("enc");
  l2.definedone();
end;
if l1.find() ne 0
then do; 
  do until (l2.check(key:maxkey) ne 0);
    maxkey + 1;
  end;
  enc = maxkey;
  rc = l1.add();
  rc = l2.add();
end;
if done then rc = l1.output(dataset:"lookup_new");
drop rc maxkey;
run;

proc sql noprint;
select int(log(nobs)) + 1 into :formlength trimmed
from dictionary.tables
where libname = "WORK" and memname = upcase("&outds.");
quit;

proc datasets lib=work nolist;
delete lookup;
change lookup_new=lookup;
modify &outds.;
format enc z&formlength..;
quit;

%mend;

%encode(have,encoded1);

data have2;
set have end=done;
output;
if _n_ = 1 /* add a random observation */
then do;
  ID = "JKL";
  output;
end;
run;

%encode(have2,encoded2);
SarahDew
Obsidian | Level 7
Thanks, that looks like a really nice approach. I see it also gives the same encoded value if the same value is added, which can be the case in my data. I have around 5 million unique IDs for one of the ID columns so there I will use Z7.

Just wondering if it is equally safe as a hash value if the initial counts are given to an ordered ID (alphabetically for strings or ascending numbers). If you figure out the first count it might be easier to guess who the next count belongs to. It might be best to make sure the original IDs are ordered randomly?
whymath
Lapis Lazuli | Level 10

It is unnecessary.  

5 is proper for your demo data. However, like you said, records will add in the future. We could not sure if 5 is proper in that time.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1567 views
  • 0 likes
  • 4 in conversation