HI everyone,
I have a general and a SAS specific question.
I am using SAS to create and maintain a large database that different sites will contribute information to.
Crucial in this process that only de-identified information on patients will be stored in the database, and de-identification of data happens at the site before sending the data to the registry.
There are two problems here, one easy, one much harder:
1. The easy problem: is there a built in procedure or a macro that anyone knows of, that can take a number of identifiers (patient's name, date of birth, Medicare ID, etc) and then generates a unique statistical ID that can be used in the database?
2. The difficult problem: The database should be able to identify repeat admissions/ data from the same patient at different sites. So the (de-identified) ID of a patient should be shared between the sites. However, as the process of de-identification should ideally happen at the site level before the data leaves to the registry, this poses a problem of how to be able to assign the same ID to the patient across multiple sites without the site sharing the identifiers between them. Has anyone dealt with a similar problem and how have you approached it?
Thanks heaps.
You could use MD5 or eventually even better SHA to create a digest value. You will need to ensure that your data providers all use the same encoding for their source data.
You could have your data providers run a demo program like below and then check if they all send you back the same sha256 values.
data demo;
set sashelp.class;
sha256 = hashing('sha256',catx('|',name,age));
run;
proc print data=demo(drop=name age);
run;
Your data providers would then send you data like:
Now for the decoding
On your end if you get the same digest value multiple times you know that you've got duplicates.
What your data providers could do is maintain a table with the digest values and the plain text values. You then just send the digest value to them and they can look-up the plain text values (the composite key) to identify the record.
...or they can also simply run their process again and select the records that get the digest value you provide.
data demo;
set sashelp.class;
sha256 = hashing('sha256',catx('|',name,age));
if sha256='4D9D84608FD17ED1B9E4F5AF1777D5471A0290B2F94060D4AC917C4FDC6FF45E';
run;
proc print data=demo;
run;
Which SHA to choose - like SHA1, 256, 512 etc. - will depend on your storage, performance and security requirements.
You could use MD5 or eventually even better SHA to create a digest value. You will need to ensure that your data providers all use the same encoding for their source data.
You could have your data providers run a demo program like below and then check if they all send you back the same sha256 values.
data demo;
set sashelp.class;
sha256 = hashing('sha256',catx('|',name,age));
run;
proc print data=demo(drop=name age);
run;
Your data providers would then send you data like:
Now for the decoding
On your end if you get the same digest value multiple times you know that you've got duplicates.
What your data providers could do is maintain a table with the digest values and the plain text values. You then just send the digest value to them and they can look-up the plain text values (the composite key) to identify the record.
...or they can also simply run their process again and select the records that get the digest value you provide.
data demo;
set sashelp.class;
sha256 = hashing('sha256',catx('|',name,age));
if sha256='4D9D84608FD17ED1B9E4F5AF1777D5471A0290B2F94060D4AC917C4FDC6FF45E';
run;
proc print data=demo;
run;
Which SHA to choose - like SHA1, 256, 512 etc. - will depend on your storage, performance and security requirements.
Considering the number of times that I find in not terribly large data sets supposedly identical people with different birth dates from a single medical provider site, I pray for your success.
By not very large I mean fewer than 1,000 records in a year. I have seen the same "person" with two birth dates for services received on the same date at the same location and the only difference being where a physical test specimen was taken from the patient.
If you have other identfiers as you say, you might consider reducing the items used for encryption / decryption involved.
Don't even get me started on names. 😢
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.