BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
acordes
Rhodochrosite | Level 12

I`ve written my own anonymizer code, it works but I'm sure that in sas viya there must be an option or function for this purpose. 

 

data casuser.test;
input codobjet $;
infile cards;
cards;
1372G2
1382NX
190901 
1T33AX 
210102 
260404 
2CAA94 
2CBT72 
2CJT94 
2EH1K5
;
run;

proc iml;

call randseed(123);
xx= randfun({4,6}, "integer", -20, 20);

print xx;

vary=repeat("a",4)||char(t(1:4));
vary2=catx("", vary[,1], vary[,2]);
 
do i=1 to 4;
temp=j(1,1, '                                                                                       ');
do j=1 to 6;
temp=catx(" ", temp, char(xx[i,j]));
end;
mac=choosec(i, "a1", "a2", "a3", "a4");
call symputx(mac, temp);
end;
quit;

%put &a3.;


data casuser.test2;
set casuser.test(obs=10);
array roll (6) _temporary_  (&a3.);
format chiffre inv_chiffre $12.;

do i=1 to length(codobjet);
chiffre=compress(catx("", chiffre, byte(rank(substr(codobjet, i, 1))+roll(i))));
end;

do j=1 to length(codobjet);
inv_chiffre=compress(catx("", inv_chiffre, byte(rank(substr(chiffre, j, 1))-roll(j))));
end;
keep codobjet chiffre inv_chiffre;

run;

 

chiffre.png

1 ACCEPTED SOLUTION

Accepted Solutions
SASKiwi
PROC Star

Here is how we anonymize real data keys:

ID_Anon = put(md5(cats('ID_ANON',ID),$hex10.);

You can't decrypt this though so you need to keep a table of the real and anonymized keys. BTW the anonymized key is repeatable and unique so can be used for table joins etc. 

View solution in original post

7 REPLIES 7
sbxkoenk
SAS Super FREQ

Hello,

 

Are you trying to do anonymisation or pseudonymisation?

 

I think you want to do pseudonymisation (I haven't studied your code though) which often comes down to "string replacement".

You can use the SHA256 algorithm to replace your identifiers (or other information) to unreadable-by-human 256-bit hash-values. SAS has a SHA256 Function!

 

If you want something human-readable (sometimes that is easier for testing and debugging) you can replace person / subject / object names by city names or by a combination of two words in a list of 100 names / flowers / rivers / colors / seas / mountains / first-names etc. 

 

Good luck,

Koen

SASKiwi
PROC Star

Here is how we anonymize real data keys:

ID_Anon = put(md5(cats('ID_ANON',ID),$hex10.);

You can't decrypt this though so you need to keep a table of the real and anonymized keys. BTW the anonymized key is repeatable and unique so can be used for table joins etc. 

ChrisNZ
Tourmaline | Level 20

Isn't trimming to 5 characters (hex10.) out of 16 going to create collisions?

 

SASKiwi
PROC Star

@ChrisNZ  - You are probably right. I was basing my example on a short key so should have considered the impact of longer ones.

Patrick
Opal | Level 21

@SASKiwi wrote:

@ChrisNZ  - You are probably right. I was basing my example on a short key so should have considered the impact of longer ones.


@ChrisNZ The length of the key is of no relevance. It's just the number of rows / distinct source strings. I've even seen once in reality a collision happening with an "untruncated" md5() hash key which makes me now always consider using a sha256 instead of a md5 as soon as row numbers go into the millions.

options ps=max;
data collisions;
  length id other_id 8 id_anon $10;
  dcl hash h1();
  h1.defineKey('ID_Anon');
  h1.defineData('other_id');
  h1.defineDone();

  do id=1 to 10**7;
    ID_Anon = put(md5(cats('ID_ANON',ID)),$hex10.);
    if h1.check()=0 then
      do;
        rc=h1.find();
        output;
        keep id other_id id_anon;
/*        leave;*/
      end;
    else
      do;
        other_id=id;
        rc=h1.add();
      end;
  end;

run;

proc print data=collisions;
run;

Collisions if truncating to $hex10.

Patrick_0-1634341484014.png

 

Patrick
Opal | Level 21

Agree with @ChrisNZ 

It should be $hex32. for a 128bit hash key. Any truncation will increase the collision risk which due to the birthday paradigm is always much higher than one would intuitively assume. 

Patrick
Opal | Level 21

@acordes I guess the Viya version will be very relevant. Below two links I found which might give you some ideas.

https://www.youtube.com/watch?v=E6yVxbitC2k

If it's only about masking values in reports: https://blogs.sas.com/content/sgf/2018/03/02/is-it-sensitive-mask-it-with-data-suppression/ 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1165 views
  • 7 likes
  • 5 in conversation