BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SarahDew
Obsidian | Level 7

I need to anonymize some data and decided to use the sha256 function to generate anonymous ID's. I made a lookup table for myself, with the original ID and the hashed ID, and an output table with only the hashed ID for external use. The point is that only I would be able to de-anonymize the data when necessary, and the externals can give me the hash so I can find the match in the lookup table. But now, when I search for one of the hashed ID's in the lookup table, it is not found, while I can see it is there. So I cannot make the match. I replicated this behaviour in a simple example:  

 

data have;
input ID;
datalines;
62851
62852
62853
62854
;
run;

data lookup;
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
run;

data out (drop=ID);
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
run;

data find;
set lookup;
where strip(ID_ano) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252";
run;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

In the WHERE condition, the raw values is used. Instead of using STRIP (which is dangerous, as the value could contain a leading or trailing '20'x), use PUT:

where put(ID_ano,$hex64.) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252";

View solution in original post

7 REPLIES 7
Kurt_Bremser
Super User

In the WHERE condition, the raw values is used. Instead of using STRIP (which is dangerous, as the value could contain a leading or trailing '20'x), use PUT:

where put(ID_ano,$hex64.) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252";
SarahDew
Obsidian | Level 7

Thanks, did it work for you? My find is still empty, not sure why though:

 

data find;
set lookup;
where put(ID_ano,$hex64.) eq "5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252";
run;
Kurt_Bremser
Super User

See this complete code:

data have;
input ID;
datalines;
62851
62852
62853
62854
;
run;

data lookup;
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
put id_ano;
run;

data out (drop=ID);
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
run;

data find;
set lookup;
where put(ID_ano,$hex64.) eq "5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252";
run;

Partial log:

 96         data find;
 97         set lookup;
 98         where put(ID_ano,$hex64.) eq "5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252";
 99         run;
 
 NOTE: There were 1 observations read from the data set WORK.LOOKUP.
       WHERE PUT(ID_ano, $HEX64.)='5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252';
 NOTE: The data set WORK.FIND has 1 observations and 2 variables.
SarahDew
Obsidian | Level 7
It works, but I see you use a different hash string in the where clause, that I don't see in the lookup or output table. Also the actual ID_ano in find is different from the one you specified. Where did you get this value?

Me: 5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252
You: 5A3558265A67E5B4FF5E2B91A28D52330863D846D473A28C82F7D0188CC23252
Kurt_Bremser
Super User

I added a PUT statement here:

data lookup;
set have;
format ID_ano $hex64.;
ID_ano = sha256(ID);
put id_ano;
run;

and copied the string from the log.

It has been my experience that the display of binary strings across the IOM bridge (if you work with Enterprise Guide) is not always correct, as the binary value might be converted somewhere along the way, and the HEX format is applied after that on the client side. We have had issues with incorrectly displayed UUID's (which are 16 binary bytes).

The display in the log is always right, as the resulting text is not altered by the IOM bridge.

SarahDew
Obsidian | Level 7
Thanks, this answers all my questions!

The externals will transfer the dataset to SAS VA (Viya), so I'm not sure how the ID's will be displayed there and if the format will be kept. I will ask them to give me some test ID's to see if I can still make the match using the log.
FreelanceReinh
Jade | Level 19

Hi @SarahDew,

 

@Kurt_Bremser's suspicion of a possible character conversion seems very plausible to me.

 


@SarahDew wrote:
Me: 5A3558265A673F3F5E2B3F3F3F523308633F463F733F3F3F3F3F183F3F3252

Note that the above hex string looks suspicious in three ways:

  1. It has length 62, hence will never match anything in $HEX64. format.
  2. I think a SHA256 digest typically looks like (roughly speaking) a randomly selected string from all possible 32-character strings. Under this assumption I would be very surprised to see a string in which one character occurs 14 times -- as is the case for '3F'x in your reported string.
  3. 3F is the hexadecimal ASCII code of the question mark '?' -- which, in turn, might be the replacement character if certain other characters cannot be (sort of) "displayed" properly.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1271 views
  • 2 likes
  • 3 in conversation