BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chrnie
Fluorite | Level 6

Hi,

I am working on a dataset containing information on childbirths where the same mother can be included multiple times with different children. To anonymize the data I want to replace the mothers' social security numbers with a random ID variable, but I need to keep track of the duplicate mothers and assign them the same ID. 

 

This seemed like a fairly straight forward task but I am failing. Can someone please help me out?

 

1 ACCEPTED SOLUTION

Accepted Solutions
TomKari
Onyx | Level 15

Here's a method that I've developed over the years.

 

  • It's a little lengthy, but it has some advantages:
    It guarantees that your fake key values will be unique;
    It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;
    It provides a lookup table so you can switch back and forth;
    Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

 

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* The record count in the log is the number of unique keys */
proc sql noprint;
	create table RecordCount as
		select distinct SSN from have;
quit;

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

 

 

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
	do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
		_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
		FakeKey = _KeyBigPart + _KeyLittlePart;
		RN = rand('uniform');
		output;
	end;
run;

proc sort;
	by RN; /* Sort by a random number to have the fake keys in no particular order */
run;

data FinalFakeKeys(drop=RN);
	set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;

/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
	set FinalFakeKeys(obs=3);
run;

/* Get all of the unique key values from your data */
proc sql noprint;
	create table UniqueKeys as
		select distinct SSN from have;
quit;

/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
	set UniqueKeys;
	set FinalFakeKeys;
run;

/* And finally substitute in the fake keys to your original data */
proc sql noprint;
	create table have_with_fake_keys as
		select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
	from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;

Tom

 

View solution in original post

5 REPLIES 5
Kurt_Bremser
Super User

You can create a format for the SSns and use that; if needed, you can create a reverse format for recreating the original SSN:

data have;
input mother_s_ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;

proc sort
  data=have (keep=mother_s_ssn)
  out=cntlin (rename=(mother_s_ssn=start))
  nodupkey
;
by mother_s_ssn;
run;

data cntlin;
set cntlin;
by start;
retain label 0;
label + 1;
fmtname = "m_ssn";
type = 'C';
if last.start;
run;

proc format library=work cntlin=cntlin;
run;

data anon;
set have;
mother_s_ssn = put(mother_s_ssn,$m_ssn.);
run;
andreas_lds
Jade | Level 19

@Kurt_Bremser: Thanks for the example dataset 😉

 

You can use a one-way-hash-function:

 

data have_ssn;
input ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;

data want_ssn;
   set work.have_ssn(rename=(ssn = ori_ssn));

   length ssn $ 32;
   format ssn $hex64.;

   ssn = sha256(ori_ssn);

   drop ori_ssn;
run;
TomKari
Onyx | Level 15

Here's a method that I've developed over the years.

 

  • It's a little lengthy, but it has some advantages:
    It guarantees that your fake key values will be unique;
    It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;
    It provides a lookup table so you can switch back and forth;
    Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

 

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* The record count in the log is the number of unique keys */
proc sql noprint;
	create table RecordCount as
		select distinct SSN from have;
quit;

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

 

 

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
	do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
		_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
		FakeKey = _KeyBigPart + _KeyLittlePart;
		RN = rand('uniform');
		output;
	end;
run;

proc sort;
	by RN; /* Sort by a random number to have the fake keys in no particular order */
run;

data FinalFakeKeys(drop=RN);
	set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;

/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
	set FinalFakeKeys(obs=3);
run;

/* Get all of the unique key values from your data */
proc sql noprint;
	create table UniqueKeys as
		select distinct SSN from have;
quit;

/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
	set UniqueKeys;
	set FinalFakeKeys;
run;

/* And finally substitute in the fake keys to your original data */
proc sql noprint;
	create table have_with_fake_keys as
		select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
	from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;

Tom

 

chrnie
Fluorite | Level 6

Thanks for helping me out even though I did not provide an example data set - still learning 🙂 Invaluable help, I had not managed to do this on my own!

TomKari
Onyx | Level 15

Glad I could help. Let's just say I've been doing this a LOOONNNNGGGGGGG time!

 

Tom

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 2324 views
  • 0 likes
  • 4 in conversation