03-09-2015 09:52 AM
I have received a file with PII variables encrypted to random string of 24 characters. To identify a unique person on my data file I have to concatenate a number of these variables together. The outcome is an ID variable with ~100 character string.
Processing this ID is inefficient so I am wondering if there is a way to convert the string ID to a numeric ID and being able to convert it back to the string if I need to? I may have millions of ID's on my data set and new ones coming in every month.
Thoughts on the approach is greatly appreciated. Thanks in advance.
03-09-2015 11:05 AM
I might ask why you need to actually concatenate the variables. Most report and analysis procedures can use multiple variables for the roll of an ID with CLASS, BY or just using all the variables. If the sole reason is to reduce the number of variables you are referencing in code you might want to consider the complexities of maintaining this process. And a 100 character string is likely to have all sorts of issues fitting into the precision of SAS numeric variables of 12 or 16 digits depending on system.
03-09-2015 11:40 AM
Yes, mainly because it is much easier to handle.
One of the encrypted variables refers to a 'family' and with the concatenation of 2 other variables, I am able to identify a 'member' within the family. For analytics, I will have 2 levels for analysis, the 'Family' and the 'Member'.
The Member_ID is what i'm looking to convert to a numeric ID if possible..
03-09-2015 11:53 AM
You may be better off using distinct lists and code values, much like an RDBMs. I.e. family a = 1, family b=2 .. Do that for each shrinkable data item, then use the code rather than the full text. So to add:
In this example, the first two characters represent 1 which is family a, the next 3 represent individual within that family etc. You can build up quite complex id's from just simple related numeric id's. To get back to character data, just left join familiy on input(substr(id,1,2),best.)=family_id.
03-09-2015 12:02 PM
Not sure how you will work on this concatenated ID value in practice. But I'm just data modeler, not an analyst.
You could out of the box hash your key, that will give you a 16 byte instead of 24 byte.
You could also hash concatenated keys (ID), which will also be 16 bytes, independent of how long your input is.
To trace it backwards, you need to store a look-up table with original - hashed-value pairs.
See MD5() function.
The data model way of solving this would be to have a master ID table.
The a relation table that hold the connections information between ID, e.g. families.
03-09-2015 02:02 PM
Thank you. I like this idea, I will give it a shot.
Maybe a dumb question, but would this solution be able to consider the addition of new ID's on a monthly basis without duplication?
I'm new to this, so thinking out loud I am imagining a process that will look like:
1. Setup up first initial hash lookup table on existing dataset
a. output of a hashed-value pair
2. On a monthly basis I receive a data feed
3. Append the data to my master table
3. Run the hashing program
a.only new id's will be generated into the hash table
4. Rollup my analytic dataset by numeric hash ids
does that make sense?
03-09-2015 02:37 PM
To avoid duplication, add only new Id's. Perhaps using MODIFY BY or any other method of choice.
Exactly how/when to create the hash value, you have to figure out for yourself - it could be done in the same step when updating the master table (not by append!)
And small correction: a hash result (from MD5) is a 128 bit-stream, not a plain numerical value. You store it in SAS in a 16 byte char variable, and perhaps using a HEX32. as format for better visualization.