Solved: Re: How do I create an ID variable accounting for duplicates?

chrnie · Posted 08-28-2017 07:53 AM

Hi,

I am working on a dataset containing information on childbirths where the same mother can be included multiple times with different children. To anonymize the data I want to replace the mothers' social security numbers with a random ID variable, but I need to keep track of the duplicate mothers and assign them the same ID.

This seemed like a fairly straight forward task but I am failing. Can someone please help me out?

TomKari · Posted 08-28-2017 10:31 AM

Here's a method that I've developed over the years.

It's a little lengthy, but it has some advantages:
It guarantees that your fake key values will be unique;
It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;
It provides a lookup table so you can switch back and forth;
Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* The record count in the log is the number of unique keys */
proc sql noprint;
	create table RecordCount as
		select distinct SSN from have;
quit;

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
	do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
		_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
		FakeKey = _KeyBigPart + _KeyLittlePart;
		RN = rand('uniform');
		output;
	end;
run;

proc sort;
	by RN; /* Sort by a random number to have the fake keys in no particular order */
run;

data FinalFakeKeys(drop=RN);
	set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;

/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
	set FinalFakeKeys(obs=3);
run;

/* Get all of the unique key values from your data */
proc sql noprint;
	create table UniqueKeys as
		select distinct SSN from have;
quit;

/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
	set UniqueKeys;
	set FinalFakeKeys;
run;

/* And finally substitute in the fake keys to your original data */
proc sql noprint;
	create table have_with_fake_keys as
		select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
	from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;

Tom

View solution in original post

Kurt_Bremser · Posted 08-28-2017 08:33 AM

You can create a format for the SSns and use that; if needed, you can create a reverse format for recreating the original SSN:

data have;
input mother_s_ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;

proc sort
  data=have (keep=mother_s_ssn)
  out=cntlin (rename=(mother_s_ssn=start))
  nodupkey
;
by mother_s_ssn;
run;

data cntlin;
set cntlin;
by start;
retain label 0;
label + 1;
fmtname = "m_ssn";
type = 'C';
if last.start;
run;

proc format library=work cntlin=cntlin;
run;

data anon;
set have;
mother_s_ssn = put(mother_s_ssn,$m_ssn.);
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

andreas_lds · Posted 08-28-2017 09:57 AM

@Kurt_Bremser: Thanks for the example dataset 😉

You can use a one-way-hash-function:

data have_ssn;
input ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;

data want_ssn;
   set work.have_ssn(rename=(ssn = ori_ssn));

   length ssn $ 32;
   format ssn $hex64.;

   ssn = sha256(ori_ssn);

   drop ori_ssn;
run;

TomKari · Posted 08-28-2017 10:31 AM

Here's a method that I've developed over the years.

It's a little lengthy, but it has some advantages:
It guarantees that your fake key values will be unique;
It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;
It provides a lookup table so you can switch back and forth;
Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* The record count in the log is the number of unique keys */
proc sql noprint;
	create table RecordCount as
		select distinct SSN from have;
quit;

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
	do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
		_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
		FakeKey = _KeyBigPart + _KeyLittlePart;
		RN = rand('uniform');
		output;
	end;
run;

proc sort;
	by RN; /* Sort by a random number to have the fake keys in no particular order */
run;

data FinalFakeKeys(drop=RN);
	set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;

/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
	set FinalFakeKeys(obs=3);
run;

/* Get all of the unique key values from your data */
proc sql noprint;
	create table UniqueKeys as
		select distinct SSN from have;
quit;

/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
	set UniqueKeys;
	set FinalFakeKeys;
run;

/* And finally substitute in the fake keys to your original data */
proc sql noprint;
	create table have_with_fake_keys as
		select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
	from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;

Tom

chrnie · Posted 08-29-2017 04:57 AM

Thanks for helping me out even though I did not provide an example data set - still learning 🙂 Invaluable help, I had not managed to do this on my own!

TomKari · Posted 08-29-2017 12:05 PM

Glad I could help. Let's just say I've been doing this a LOOONNNNGGGGGGG time!

Tom

How do I create an ID variable accounting for duplicates?

Re: How do I create an ID variable accounting for duplicates?

Re: How do I create an ID variable accounting for duplicates?

Re: How do I create an ID variable accounting for duplicates?

Re: How do I create an ID variable accounting for duplicates?

Re: How do I create an ID variable accounting for duplicates?

Re: How do I create an ID variable accounting for duplicates?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away