DATA Step, Macro, Functions and more

Encrypting - decrypting values?

Reply
Frequent Contributor
Posts: 81

Encrypting - decrypting values?

Hi,

I have received a file with PII variables encrypted to random string of 24 characters. To identify a unique person on my data file I have to concatenate a number of these variables together. The outcome is an ID variable with ~100 character string.

Processing this ID is inefficient so I am wondering if there is a way to convert the string ID to a numeric ID and being able to convert it back to the string if I need to? I may have millions of ID's on my data set and new ones coming in every month.

Thoughts on the approach is greatly appreciated. Thanks in advance.

Super User
Posts: 11,343

Re: Encrypting - decrypting values?

I might ask why you need to actually concatenate the variables. Most report and analysis procedures can use multiple variables for the roll of an ID with CLASS, BY or just using all the variables. If the sole reason is to reduce the number of variables you are referencing in code you might want to consider the complexities of maintaining this process. And a 100 character string is likely to have all sorts of issues fitting into the precision of SAS numeric variables of 12 or 16 digits depending on system.

Frequent Contributor
Posts: 81

Re: Encrypting - decrypting values?

Hi ballardw,

Yes, mainly because it is much easier to handle.

One of the encrypted variables refers to a 'family' and with the concatenation of 2 other variables, I am able to identify a 'member' within the family. For analytics, I will have 2 levels for analysis, the 'Family' and the 'Member'.

The Member_ID is what i'm looking to convert to a numeric ID if possible..

Super User
Super User
Posts: 7,942

Re: Encrypting - decrypting values?

You may be better off using distinct lists and code values, much like an RDBMs.  I.e. family a = 1, family b=2 ..  Do that for each shrinkable data item, then use the code rather than the full text.  So to add:

01001...

In this example, the first two characters represent 1 which is family a, the next 3 represent individual within that family etc.  You can build up quite complex id's from just simple related numeric id's.  To get back to character data, just left join familiy on input(substr(id,1,2),best.)=family_id.

Super User
Posts: 5,424

Re: Encrypting - decrypting values?

Not sure how you will work on this concatenated ID value in practice. But I'm just data modeler, not an analyst.

You could out of the box hash your key, that will give you a 16 byte instead of 24 byte.

You could also hash concatenated keys (ID), which will also be 16 bytes, independent of how long your input is.

To trace it backwards, you need to store a look-up table with original - hashed-value pairs.

See MD5() function.

The data model way of solving this would be to have a master ID table.

The a relation table that hold the connections information between ID, e.g. families.

Data never sleeps
Frequent Contributor
Posts: 81

Re: Encrypting - decrypting values?

Thank you. I like this idea, I will give it a shot.

Maybe a dumb question, but would this solution be able to consider the addition of new ID's on a monthly basis without duplication?

I'm new to this, so thinking out loud I am imagining a process that will look like:

1. Setup up first initial hash lookup table on existing dataset

     a. output of a hashed-value pair

2. On a monthly basis I receive a data feed

3. Append the data to my master table

3. Run the hashing program

     a.only new id's will be generated into the hash table

4. Rollup my analytic dataset by numeric hash ids

does that make sense? 

Super User
Posts: 5,424

Re: Encrypting - decrypting values?

To avoid duplication, add only new Id's. Perhaps using MODIFY BY or any other method of choice.

Exactly how/when to create the hash value, you have to figure out for yourself - it could be done in the same step when updating the master table (not by append!)

And small correction: a hash result (from MD5) is a 128 bit-stream, not a plain numerical value. You store it in SAS in a 16 byte char variable, and perhaps using a HEX32. as format for better visualization.

Data never sleeps
Ask a Question
Discussion stats
  • 6 replies
  • 438 views
  • 0 likes
  • 4 in conversation