- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I need to anonymize some data and decided to use the sha256 function to generate anonymous ID's. I made a lookup table for myself, with the original ID and the hashed ID, and an output table with only the hashed ID for external use. The point is that only I would be able to de-anonymize the data when necessary, and the externals can give me the hash so I can find the match in the lookup table. But now, when I search for one of the hashed ID's in the lookup table, it is not found, while I can see it is there. So I cannot make the match. I replicated this behaviour in a simple example:
data have;
input ID;
datalines;
62851
62852
62853
62854
;
run;
data lookup;
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
run;
data out (drop=ID);
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
run;
data find;
set lookup;
where strip(ID_ano) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252";
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In the WHERE condition, the raw values is used. Instead of using STRIP (which is dangerous, as the value could contain a leading or trailing '20'x), use PUT:
where put(ID_ano,$hex64.) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252";
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In the WHERE condition, the raw values is used. Instead of using STRIP (which is dangerous, as the value could contain a leading or trailing '20'x), use PUT:
where put(ID_ano,$hex64.) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252";
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, did it work for you? My find is still empty, not sure why though:
data find;
set lookup;
where put(ID_ano,$hex64.) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252";
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
See this complete code:
data have;
input ID;
datalines;
62851
62852
62853
62854
;
run;
data lookup;
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
put id_ano;
run;
data out (drop=ID);
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
run;
data find;
set lookup;
where put(ID_ano,$hex64.) eq "5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252";
run;
Partial log:
96 data find; 97 set lookup; 98 where put(ID_ano,$hex64.) eq "5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252"; 99 run; NOTE: There were 1 observations read from the data set WORK.LOOKUP. WHERE PUT(ID_ano, $HEX64.)='5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252'; NOTE: The data set WORK.FIND has 1 observations and 2 variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Me: 5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252
You: 5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I added a PUT statement here:
data lookup;
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
put id_ano;
run;
and copied the string from the log.
It has been my experience that the display of binary strings across the IOM bridge (if you work with Enterprise Guide) is not always correct, as the binary value might be converted somewhere along the way, and the HEX format is applied after that on the client side. We have had issues with incorrectly displayed UUID's (which are 16 binary bytes).
The display in the log is always right, as the resulting text is not altered by the IOM bridge.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The externals will transfer the dataset to SAS VA (Viya), so I'm not sure how the ID's will be displayed there and if the format will be kept. I will ask them to give me some test ID's to see if I can still make the match using the log.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @SarahDew,
@Kurt_Bremser's suspicion of a possible character conversion seems very plausible to me.
@SarahDew wrote:
Me: 5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252
Note that the above hex string looks suspicious in three ways:
- It has length 62, hence will never match anything in $HEX64. format.
- I think a SHA256 digest typically looks like (roughly speaking) a randomly selected string from all possible 32-character strings. Under this assumption I would be very surprised to see a string in which one character occurs 14 times -- as is the case for '3F'x in your reported string.
- 3F is the hexadecimal ASCII code of the question mark '?' -- which, in turn, might be the replacement character if certain other characters cannot be (sort of) "displayed" properly.