Pseudonymization of sensitive data

Anita_n · Posted 01-04-2024 11:00 AM

I have some tables of data which I need to pseudonymise. The main table has the real name of patients in addition to their ids which is used to identify them. Since a patients can have one or more diesease, there are several records per patient ID.

There are also several sub tables containing different examinations and findings of patient like lab values etc, here too there are many records per patient.

The aim is to create a pseudonym for each patient and also clearly identify which record in the sub table belongs to which patient.

Any idea how to approach this problem, probably with an example

ballardw · Posted 01-04-2024 11:04 AM

Why use the "name" at all if you have a patient ID number? Us the ID for everything except billing or wherever appropriate to use real names.

If you have multiple users for these tables that you don't want to know that information then perhaps create views for them to use that do not have the actual name variables at all. Or isolate the names elsewhere and remove from the data.

ChrisHemedinger · Posted 01-04-2024 11:36 AM

You could replace the name with a one-way hash of the name value. Example:

data hash_class;
 length hashname $32;
 format hashname $hex64.;
 set sashelp.class;
 hashname = sha256(name);
 drop name;
run;

The hash values will be unique (per name value) but you won't be able to derive the original name from the hash. Make it even more "secure"' by adding a salt value (which would be a secret value that would make it difficult for someone to predict the hash value from an original name).

Note that a patient ID is also PII within your system (I'd guess), so anonymizing the name may not be enough for the purpose. Anyone with access to a patient database could figure out which patient is which. If the goal is to create a report that summarizes characteristics of a patient population without needing to back and identify specific patients, consider building a hash of the name and ID (which would yield a unique identifier for the patient that could not be traced back directly to a person) or maybe just hash the ID and drop the name.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!

Anita_n · Posted 01-04-2024 12:13 PM

@ChrisHemedinger @ballardw Thanks for that, am thinking of creating the pseudonym from the name and date of birth and/or id. But I will like to be able to decrypt this value again. So how do I add this salt value?

I will not be having multiple users in this case

SASKiwi · Posted 01-04-2024 03:06 PM

You can't decrypt hashed values - that's the whole point of hash techniques, that they are virtually impossible to reverse. You need to keep the original values as well as the hashed versions to be able to join them back in later if required.

A_Kh · Posted 01-04-2024 05:42 PM

There was a similar question in the community recently which I couldn't locate. I believe the accepted solution was encoding the patient id using $hex64. format , and decoding it back the same way. This might be helpful in your case as well. Below is a replicated code:

Data have;
input id $;
cards;
100011
100012
100013
;

data want; 
	set have; 
	length encoded_id decoded_id $ 200;
	encoded_id=put(id,$hex64.);
	decoded_id=input(encoded_id,$hex64.);
proc print; run;

Patrick · Posted 01-04-2024 10:04 PM

@Anita_n

It's a common scenario that one needs to mask PII columns for a Dev and TEST environment while maintaining a way for nominated people like testers to "de-mask" values so they can refer back to prod source systems for investigation.

Another requirement is often for the masked values to fit into the existing table structures - especially column type and length - which makes using SHA and similar approaches not suitable.

And the same cleartext value should always map into the same masked value.

What worked for me in the past:
1. Maintain a permanent lookup table per PII column in a protected location with two columns: cleartext_value, masked_value
2. The masked value is just a sequence number. Whenever there is a new cleartext value just create a new value pair with max sequence number plus one for the masked value.

Below sample code for masking a character variable to illustrate the approach:

options dlcreatedir;
libname secure "%sysfunc(pathname(work))\secured_folder";

%if not %sysfunc(exist(secure.name)) %then
  %do;
    data secure.name(index=(name/unique __name_masked/unique));
      stop;
      set sashelp.class(keep=name);
      __name_masked=name;
    run;
  %end;

data work.class_clear;
  set sashelp.class;
  output;
  if _n_=3 then output;
run;

data work.class_masked(drop=__:);
  if _n_=1 then
    do;
      if 0 then set sashelp.class secure.name;
      dcl hash h1(dataset:'secure.name');
      h1.defineKey('name');
      h1.defineData('name', '__name_masked');
      h1.defineDone();
      
    end;
  call missing(of _all_);

  set work.class_clear end=__last;
  if h1.find() ne 0 then 
    do;
      __name_masked=put(h1.num_items +1,f16. -l);
      __rc=h1.add();
    end;
  name=__name_masked;
  if __last then h1.output(dataset:'work.__name');
run;

proc datasets lib=work nolist nowarn;
  append base=secure.name data=work.__name nowarn;
  run;
  delete __name;
  run;
quit;

proc print data=work.class_masked;
run;

proc print data=secure.name;
run;

AhmedAl_Attar · Posted 01-04-2024 06:45 PM

Hi @Anita_n

Have a look at this paper https://support.sas.com/resources/papers/proceedings16/2500-2016-poster.pdf by my friend @AndySmith

His presentation tackled/addressed similar issue back in 2016.

Hope this helps,

Ahmed

Anita_n · Posted 01-05-2024 07:34 AM

Thankyou all for all the suggestions and examples. I will try them to find out which will work for me.

I will then give a feedback

s_lassen · Posted 01-06-2024 07:12 AM

Just did something like that. The solution I used went more or less like this:

proc sql;
  create table anonymization_id (id num,anon_id num);
  create unique index id on anonymization_id(id);
quit;

data patient1_anon(drop=id) anonymization_id;
  set patient1;
  modify anonymization_id key=id/unique nobs=n_id;
  if _iorc_ then do; /* assuming that the ID was not found */
    n_id+1;
    anon_id=n_id;
    output  anonymization_id;
    _error_=0;
    end;
  output patient1_anon;
run;

To anonymize the next patient table, just use a similar datastep (but keeping the same anonymization_id table), and you will have one table with all the anonymizations used, and anonymized versions of the patient data tables (just remember to write/copy them to a permanent library, not WORK).

If you need to anonymize the text variable NAME as well, you could just insert something like

name=cats('Dummy',anon_id);

before the second output statement, the patients' name will then be anonymized as well.

By using the same anonymization table for all the tables containing Patients' IDs, you will get the same translation from real to anonymized ID in all the tables.

ErikLund_Jensen · Posted 01-06-2024 09:37 AM

Hi @Anita_n

I will suggest the following. It works after the same principles as other suggestions with a hash value. The idea is to create pseudonym tables from "real" input tables WITHOUT keeping a table with ID's and pseudonyms.

The two set of values values don't exist together outside the running program, so there are no translation tables to maintain and protect from unautohorized access. At the same time, real identities can easily be restored from the original input tables by joining using the same ID pseudonymization in the join, but it is impossible without access to these

* Test data;
data maintable;
  ID = 12345; Name = 'Billy Nash'; Desease = 11; output;
  ID = 12345; Name = 'Billy Nash'; Desease = 17; output;
  ID = 23456; Name = 'jeff Smith'; Desease = 22; output;
  ID = 34567; Name = 'John Doe'; Desease = 33; output;
run;

* Create Pseudonym table;
* ID is converted to MD5 hash value,
* Name is converted to a defaultname + patient counter, this gives some extra coding 
    thanks to several records prt ID;
proc sort data=maintable;
  by ID Desease;
run;

data anontable (drop=ID Name pnr);
  length PseudoID $36 PseudoName $40;
  set maintable;
  by ID;
  retain pnr 0;
  if first.ID then pnr = pnr + 1;
  PseudoID = put(md5(put(ID,12.)),$hex32.);
  PseudoName = catx(' ', 'Patient', put(pnr,8.), 'Pseudoname');
run;


* Restore identities;
proc sql;
  create table restore as
    select distinct
      maintable.ID,
      maintable.Name,
      anontable.Desease
    from anontable
    left join maintable
    on put(md5(put(maintable.ID,12.)),$hex32.) = anontable.PseudoID
  ;
quit;

SASKiwi · Posted 01-06-2024 03:39 PM

@ErikLund_Jensen - This is a very cool approach! Just need to ensure the hashing technique remains the same to retrieve the original values. Also the anonymised names make it clear what has been done.

sbxkoenk · Posted 01-07-2024 07:54 AM

Home > SAS Community Nordic > SAS Nordic Users Group >
Juletip #1 - Even Santa hides his secrets – How to mask your data in SAS Studio?
Posted 12-01-2023 02:07 AM
https://communities.sas.com/t5/SAS-Community-Nordic/Juletip-1-Even-Santa-hides-his-secrets-How-to-ma...

BR, Koen

Anita_n · Posted 01-09-2024 09:14 AM

@sbxkoenk @SASKiwi @ErikLund_Jensen @s_lassen @Patrick I have tried all the samples you posted they are all very good ways to solved this problem depending on ones needs. Thankyou very much.

All the same I have a question, assuming I have an ID=12345 in the main table and the ID in the sub table is assigned 12345_c because the patient had a chemotherapy and this id belongs to the chemotherapy table. The other ID is 12345_r because of the radiotherapy treatment (This id's a have this extensions because the data comes from different centers).

My problem is if I use the hash function sha256 or md5 and use the format $hex64. I get different pseudonyms as the main ID's. The idea behinde all this is to assign both 12345_c and 12345_r to the main ID 12345

Is there any solution to this using the hash function or do I will need to define my own values for the ID's

SASKiwi · Posted 01-09-2024 02:04 PM

How about just hashing the main ID then adding the _c or _r unhashed to get the secondary IDs?

Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Re: Pseudonymization of sensitive data

Registration is open

SAS Training: Just a Click Away