BookmarkSubscribeRSS Feed
DangIT
Fluorite | Level 6

Hi,

I have received a file with PII variables encrypted to random string of 24 characters. To identify a unique person on my data file I have to concatenate a number of these variables together. The outcome is an ID variable with ~100 character string.

Processing this ID is inefficient so I am wondering if there is a way to convert the string ID to a numeric ID and being able to convert it back to the string if I need to? I may have millions of ID's on my data set and new ones coming in every month.

Thoughts on the approach is greatly appreciated. Thanks in advance.

6 REPLIES 6
ballardw
Super User

I might ask why you need to actually concatenate the variables. Most report and analysis procedures can use multiple variables for the roll of an ID with CLASS, BY or just using all the variables. If the sole reason is to reduce the number of variables you are referencing in code you might want to consider the complexities of maintaining this process. And a 100 character string is likely to have all sorts of issues fitting into the precision of SAS numeric variables of 12 or 16 digits depending on system.

DangIT
Fluorite | Level 6

Hi ballardw,

Yes, mainly because it is much easier to handle.

One of the encrypted variables refers to a 'family' and with the concatenation of 2 other variables, I am able to identify a 'member' within the family. For analytics, I will have 2 levels for analysis, the 'Family' and the 'Member'.

The Member_ID is what i'm looking to convert to a numeric ID if possible..

RW9
Diamond | Level 26 RW9
Diamond | Level 26

You may be better off using distinct lists and code values, much like an RDBMs.  I.e. family a = 1, family b=2 ..  Do that for each shrinkable data item, then use the code rather than the full text.  So to add:

01001...

In this example, the first two characters represent 1 which is family a, the next 3 represent individual within that family etc.  You can build up quite complex id's from just simple related numeric id's.  To get back to character data, just left join familiy on input(substr(id,1,2),best.)=family_id.

LinusH
Tourmaline | Level 20

Not sure how you will work on this concatenated ID value in practice. But I'm just data modeler, not an analyst.

You could out of the box hash your key, that will give you a 16 byte instead of 24 byte.

You could also hash concatenated keys (ID), which will also be 16 bytes, independent of how long your input is.

To trace it backwards, you need to store a look-up table with original - hashed-value pairs.

See MD5() function.

The data model way of solving this would be to have a master ID table.

The a relation table that hold the connections information between ID, e.g. families.

Data never sleeps
DangIT
Fluorite | Level 6

Thank you. I like this idea, I will give it a shot.

Maybe a dumb question, but would this solution be able to consider the addition of new ID's on a monthly basis without duplication?

I'm new to this, so thinking out loud I am imagining a process that will look like:

1. Setup up first initial hash lookup table on existing dataset

     a. output of a hashed-value pair

2. On a monthly basis I receive a data feed

3. Append the data to my master table

3. Run the hashing program

     a.only new id's will be generated into the hash table

4. Rollup my analytic dataset by numeric hash ids

does that make sense? 

LinusH
Tourmaline | Level 20

To avoid duplication, add only new Id's. Perhaps using MODIFY BY or any other method of choice.

Exactly how/when to create the hash value, you have to figure out for yourself - it could be done in the same step when updating the master table (not by append!)

And small correction: a hash result (from MD5) is a 128 bit-stream, not a plain numerical value. You store it in SAS in a 16 byte char variable, and perhaps using a HEX32. as format for better visualization.

Data never sleeps

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1982 views
  • 0 likes
  • 4 in conversation