BookmarkSubscribeRSS Feed
Beargrad04
Fluorite | Level 6

I'm trying to take my dataset of students with their ID and create a 10-digit integer "NEWid" that will allow me to create a key that pairs their "NEWid" with the student ID and make their data/info I disseminate anonymous. 

 

I've read through forums but nothing seems to be paired with what I'm looking for. Can anyone please help me? 

19 REPLIES 19
PGStats
Opal | Level 21

What have you tried (SAS code)?

PG
Beargrad04
Fluorite | Level 6

I hadn't because I didn't know where to start tbh.

 

I currently have an excel spreadsheet that I will then import and pair with each cohort. It takes longer but it's the only way I know that works and I was trying to cut corners to expedite my process.

 

I have about 30000 people per dataset.

Reeza
Super User

https://gist.github.com/statgeek/fd94b0b6e78815430c1340e8c19f8644

/*This program demonstrates how to create a basic anonymized 
key for a unique identifier. Ensure you set the value in CALL
STREAMINIT()/RANDOM_SEED macro variable to ensure you can 
replicate the keys if needed*/

%let random_seed = 30;

*list of unique values;
proc sql; 
create table unique_list as
select distinct name
from sashelp.class;
quit;

*add random values;
data random_values;
set unique_list;
call streaminit(&random_seed.);
rand = rand('normal', 50, 10);
run;

*sort;
proc sort data=random_values;
by rand;
run;

*Assign ID to N, note this is a character format;
data ID_key_pair;
set random_values;
label = put(_n_, z5.);

fmtname = 'anon_fmt';
type='C';
start=name;
run;

*Create a format;
proc format cntlin=id_key_pair;
run;

*Create dataset with anonymized IDs;
data want;
set sashelp.class;
RandomID = put(name, $anon_fmt.);
*drop name;
run;
Beargrad04
Fluorite | Level 6

This is confusing to me because I am not an expert in SAS.

 

This seems like what I would like to do. My question is where do I set parameters for each dataset I have so that a new "NEWid" field is created, and is paired with the exact number of people in my data file? I will have about 30000 and 9 datasets.

Reeza
Super User

Did you run the code and see what was happening?


@Beargrad04 wrote:

This is confusing to me because I am not an expert in SAS.

 

This seems like what I would like to do. My question is where do I set parameters for each dataset I have so that a new "NEWid" field is created, and is paired with the exact number of people in my data file? I will have about 30000 and 9 datasets.


 

hashman
Ammonite | Level 13

@Beargrad04:

Why do you need a 10-digit integer for your anonymous ID? Do you intend to do some calculations with it? 

If not and it just an anonymous ID and nothing else, why muck around with random integers if you can simply associate the existing ID with its 16-byte MD5 digest:

  anonID = put (md5 (cats(id)), $16.) ;
 format anonID $hex32. ;

CATS is used just in case ID is numeric, so that the formula works for both numeric and character IDs. Attaching the $hex32. format to anonID is optional; but it makes anonID prettier to look at because the formatted image contains only hex digits (0-9, A-F) rather than a bunch of special characters.

 

This approach has a number of advantages:

  1. Every ID will always have one and only one corresponding anonID and vice versa. 
  2. It's practically impossible to reconstruct ID from anonID, so if you send anonID to whomever it concerns, the real IDs are perfectly hidden.
  3. When you get a new ID you don't have anonID for yet, you don't have to do any lookups to make sure an already existing anonID is not reused.
  4. You don't even have to keep ID*anonID cross-reference because you can always regenerate the anonIDs from the IDs using the above formula.  
  5. Due to the nature of MD5 (it's a one-way hash function), anonIDs will be highly random (if that matters and/or is part of your specs).

Kind regards

Paul D.

  

Reeza
Super User

FYI - MD5 hash in real life released data. 

 

https://www.theguardian.com/technology/2014/jun/27/new-york-taxi-details-anonymised-data-researchers...

 

Panduragan realised that the medallion and licence numbers both have a very specific format. Medallions only take one of three formats – either 5X55, XX555 or XXX555 – while licences are all six-digit or seven-digit numbers starting with a five. That means that there are only 2m possible license numbers, and 22m possible medallion numbers.

That let Panduragan reverse-engineer the anonymised data to find out which trips were carried out by which drivers, and in which taxis. The data had been anonymised by hashing, a cryptographic function which is supposed to be "one-way": it's very easy to find the hash of a given piece of data, and very hard – mathematically impossible, in theory – to find the piece of data which resulted in a given hash (for instance, the MD5 hash, the particular type used by NYC, of the data "Alex" is a08372b70196c21a9229cf04db6b7ceb). As the same piece of data always results in the same hash, such functions are frequently used to anonymise just this sort of data.

But once Panduragan had narrowed the possible entries down to 24m different numbers, it was the matter of only minutes to determine which numbers were associated with which pieces of anonymised data.

"Modern computers are fast: so fast that computing the 24m hashes took less than two minutes," he said. "It took a while longer to de-anonymise the entire dataset, but… [I] had it done within an hour.

"There’s a ton of resources on NYC Taxi and Limousine commission, including a mapping from licence number to driver name, and a way to look up owners of medallions. I haven’t linked them here but it’s easy to find using a quick Google search… This anonymisation is so poor that anyone could, with less than two hours work, figure which driver drove every single trip in this entire dataset. It would even be easy to calculate drivers' gross income or infer where they live."

Paduragan points out that there are a number of ways that the city could have more successfully anonymised the data. The first is if they hadn't tried to be so smart: rather than going through the effort of hashing the data, if they had simply assigned random numbers to each licence plate, it would have been much more difficult to work backwards. New York's Taxi and Limousine Commission was asked for comment, but didn't respond by publication time.

Beargrad04
Fluorite | Level 6

So on the one hand I have a method that is supposed to be beneficial; on the other hand that same method can be decrypted and is not as randomized as randomized digits I'm trying to find out how to do.

 

As it stands, it sounds as if the excel import process is the way to go so far.

 

So I import a set of random ID numbers paired with my IDs I want made anonymous, then I merge the file with said IDs and data with the anonymous ID imported by the said ID. That way my anonymous ID is paired with both files correctly and synched up with the rest of the fields I want to report.

Beargrad04
Fluorite | Level 6

By the way I appreciate all of your prompt responses and efforts to help. I know it can read a bit snippy on here, but on the contrary I'm very appreciative! Thank you all!

ballardw
Super User

@Beargrad04 wrote:

So on the one hand I have a method that is supposed to be beneficial; on the other hand that same method can be decrypted and is not as randomized as randomized digits I'm trying to find out how to do.

 

As it stands, it sounds as if the excel import process is the way to go so far.

 

So I import a set of random ID numbers paired with my IDs I want made anonymous, then I merge the file with said IDs and data with the anonymous ID imported by the said ID. That way my anonymous ID is paired with both files correctly and synched up with the rest of the fields I want to report.


Lets go through what @Reeza did with an example data set

This gets a unique list of names for associating. If your data already does not have any duplicates of the personal id values that would not be needed. BUT you really should verify that you have no duplicates before you start anything and with 30K+ plus records in an Excel sheet how did you verify no duplicates? You would use the name of your data set instead of SASHELP.Class.

proc sql; 
create table unique_list as
select distinct name
from sashelp.class;
quit;

 

 

This added a random number with the RAND function. If your data has no duplicates you could place the name of your data set than needs random numbers on the SET statement. There are different RAND function options, if you really need an integer you could use the FLOOR function on the result of the RAND function. Or something like (rand('integer',1000000000,9999999999) to directly generate 10 significant digit values.

This is the FIRST step you need if your list is 100% unique.

data random_values;
set unique_list;
call streaminit(&random_seed.);
rand = rand('normal', 50, 10);
run;

This scrambles your list of ids into a random order based on the random number result. It is needed here because Reeza's example does not actually have an existing ID number, only the name and she want to create an id that is not based ont the alphabetical order of the names in the SASHELP.CLASS data set. The code below will create an "id" value based on the random order.

proc sort data=random_values;
by rand;
run;

Important note at this time: You likely should make sure that the data set Random_values is stored in a permanent library as you might need it later.

 

This step adds the Id, a simple numeric order assigned after the sort. You would not need that if you already have an ID value. The additional pieces are preparing to create a Format, which is one of the ways SAS can look up values quickly. A format will display one value, the "START" value with the actual value of the "LABEL" . In your case you would want the RAND value as the LABEL and your existing ID variable name as the START. If your ID is actually a numeric value (bad idea put folks do that) then change the Type='C' to Type='N' so SAS understand it will be manipulating a numeric value. This would be the SECOND step that you need.

 

data ID_key_pair;
set random_values;
label = put(_n_, z5.);

fmtname = 'anon_fmt';
type='C';
start=name;
run;

This code actually creates the format. CNTLIN tells the procedure that the data set contains the instructions to create a format.

If you have a permanent library associated with this project you could add LIBRARY= lib to create the format in that library. You would use the SAS option FMTSEARCH to add that library to the search path to make it useable. The default shown below would create the format in the WORK library and be found by default. BUT the proc format code shown would need to be rerun every time you want to use the format. The THIRD step.

proc format cntlin=id_key_pair;
run;

And the following code uses the PUT function to create the random Id based on the value of the START values used above. In the example that is the value of the Name variable.

data want;
set sashelp.class;
RandomID = put(name, $anon_fmt.);
*drop name;
run;

Your code would look very similar something like: and the LAST step needed to add the Newid to your data set. Note: you really do want to create a new data set. If you make a mistake you do not want to take the chance of destroying or corrupting your original data unless it is very easy to recreate exactly as it was at the start.

data want;
set yourdatsetname;
NewId = put(youridvariable, $anon_fmt.);

run;

I can't be any more specific with data set names or variables because you have not actually shared what you have.

 

Now, a critic piece of information that may have forgotten to mention: will you have to add new student ids to this list? Because you do not want to restart this process with a full list unless you actually intend to remove the previous "newid" with something else so any previous reports or such that seem to have needed the "newid" would not reference the current look up created between the added list and the old list of students.

hashman
Ammonite | Level 13

@ballardw:

A nice recap and explanation; thanks 1e6 for that.

 

But let's step back a bit and look at what this program does and at what cost.

What it does is merely the following:

  1. Creates contiguous integers from 1 to the number of unique IDs on the file
  2. Assigns them randomly to the corresponding IDs

What is costs:

  1. First pass through the input file to get the list of the unique IDs and create file Unique_List
  2. Read Unique_List, generate a random variate for each record and create file Random_Values
  3. Sort Random_Values by the random variate
  4. Read Random_Values and create a CNTLIN= file ID_Key_Pair
  5. Read ID_Key_Pair to create a format pairing each unique value if ID with _N_ from the sorted file

Besides, the sole purpose of formatting the output RandomID as Z5. is to circumvent the fact that no informat or format can pair a numeric variable to a numeric variable. Basically, using Z5. is tantamount to hard coding that will get busted as soon as the number of unique ID values exceeds 99,999. 

 

None of which would matter if the exact same thing couldn't be easily done via two passes through the input file and writing only one file in the process - that is, the output file WANT itself. But it can be; for example (note that HAVE below is created to represent a file with duplicate IDs):

data have ;                                   
  set sashelp.class (rename=name=id) ;        
  do _n_ = 1 to ceil (ranuni (1) * 3) ;       
    output ;                                  
  end ;                                       
run ;                                         
                                              
proc sql noprint ;                            
  select count (unique id) into :n from have ;
quit ;                                        
                                              
data want (drop = _:) ;                       
  array rr [&n] _temporary_ (1:&n) ;          
  if _n_ = 1 then do ;                        
    call streaminit (30) ;                    
    dcl hash h () ;                           
    h.definekey ("id") ;                      
    h.definedata ("randomID") ;               
    h.definedone () ;                         
  end ;                                       
  set have ;                                  
  if h.find() = 0 then return ;               
  _count + 1 ;                                
  _index = rand ("integer", &n - _count + 1) ;
  RandomID = rr [_index] ;                    
  rr [_index] = rr [&n - _count + 1] ;        
  h.add() ;                                   
run ;                                         

Kind regards

Paul D.

 

    

 

 

   

sasuser_sk
Quartz | Level 8

Hi Hashman:

I am using your code and find it very useful in assigning unique id to my customer_no. I do not have any duplicates (n=322,391) in my data (sorted and removed them) but by using your code want data is doubled (644,457) generating duplicates. Any suggestion on why this might be happening? I am quite a beginner and learning to understand your code. Thanks!

hashman
Ammonite | Level 13

@Reeza

"Panduragan realised that the medallion and licence numbers both have a very specific format. Medallions only take one of three formats – either 5X55, XX555 or XXX555 – while licences are all six-digit or seven-digit numbers starting with a five. That means that there are only 2m possible license numbers, and 22m possible medallion numbers ... There’s a ton of resources on NYC Taxi and Limousine commission, including a mapping from licence number to driver name, and a way to look up owners of medallions."

 

That's the key. If you don't know the nature of the original ID and/or its pool of possible distinct values is practically unlimited, good luck working it backwards. Plus, instead of using MD5, they could've used SHA256, which takes about 30 times longer to compute. Besides, the author admits that even if the anonymous IDs were picked randomly, he could still reconstruct the original IDs because he has so much information about their structure.

 

Another key is that he also knew that MD5 was used but once. Suppose that instead of using the single MD5(ID) they merely used MD5(MD5(MD5(ID))) and he didn't know the number of nestings (which, by the way, can be made varying depending on ID) ... you get the rest of the picture.     

PGStats
Opal | Level 21

I wouldn't call point 4. an advantage, unless you combine the ID with some secret password before calling MD5.

PG

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 4823 views
  • 20 likes
  • 8 in conversation