DATA Step, Macro, Functions and more

How do I create an ID variable accounting for duplicates?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

How do I create an ID variable accounting for duplicates?

Hi,

I am working on a dataset containing information on childbirths where the same mother can be included multiple times with different children. To anonymize the data I want to replace the mothers' social security numbers with a random ID variable, but I need to keep track of the duplicate mothers and assign them the same ID. 

 

This seemed like a fairly straight forward task but I am failing. Can someone please help me out?

 


Accepted Solutions
Solution
‎08-29-2017 04:54 AM
PROC Star
Posts: 1,288

Re: How do I create an ID variable accounting for duplicates?

Posted in reply to andreas_lds

Here's a method that I've developed over the years.

 

  • It's a little lengthy, but it has some advantages:
    It guarantees that your fake key values will be unique;
    It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;
    It provides a lookup table so you can switch back and forth;
    Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

 

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* The record count in the log is the number of unique keys */
proc sql noprint;
	create table RecordCount as
		select distinct SSN from have;
quit;

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

 

 

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
	do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
		_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
		FakeKey = _KeyBigPart + _KeyLittlePart;
		RN = rand('uniform');
		output;
	end;
run;

proc sort;
	by RN; /* Sort by a random number to have the fake keys in no particular order */
run;

data FinalFakeKeys(drop=RN);
	set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;

/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
	set FinalFakeKeys(obs=3);
run;

/* Get all of the unique key values from your data */
proc sql noprint;
	create table UniqueKeys as
		select distinct SSN from have;
quit;

/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
	set UniqueKeys;
	set FinalFakeKeys;
run;

/* And finally substitute in the fake keys to your original data */
proc sql noprint;
	create table have_with_fake_keys as
		select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
	from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;

Tom

 

View solution in original post


All Replies
Super User
Posts: 9,868

Re: How do I create an ID variable accounting for duplicates?

You can create a format for the SSns and use that; if needed, you can create a reverse format for recreating the original SSN:

data have;
input mother_s_ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;

proc sort
  data=have (keep=mother_s_ssn)
  out=cntlin (rename=(mother_s_ssn=start))
  nodupkey
;
by mother_s_ssn;
run;

data cntlin;
set cntlin;
by start;
retain label 0;
label + 1;
fmtname = "m_ssn";
type = 'C';
if last.start;
run;

proc format library=work cntlin=cntlin;
run;

data anon;
set have;
mother_s_ssn = put(mother_s_ssn,$m_ssn.);
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Valued Guide
Posts: 515

Re: How do I create an ID variable accounting for duplicates?

@KurtBremser: Thanks for the example dataset ;-)

 

You can use a one-way-hash-function:

 

data have_ssn;
input ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;

data want_ssn;
   set work.have_ssn(rename=(ssn = ori_ssn));

   length ssn $ 32;
   format ssn $hex64.;

   ssn = sha256(ori_ssn);

   drop ori_ssn;
run;
Solution
‎08-29-2017 04:54 AM
PROC Star
Posts: 1,288

Re: How do I create an ID variable accounting for duplicates?

Posted in reply to andreas_lds

Here's a method that I've developed over the years.

 

  • It's a little lengthy, but it has some advantages:
    It guarantees that your fake key values will be unique;
    It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;
    It provides a lookup table so you can switch back and forth;
    Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

 

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* The record count in the log is the number of unique keys */
proc sql noprint;
	create table RecordCount as
		select distinct SSN from have;
quit;

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

 

 

/* Generate some synthetic records to represent your actual data */
data have;
	input SSN Var1 Var2; /* etc. */
	cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;

/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
	do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
		_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
		FakeKey = _KeyBigPart + _KeyLittlePart;
		RN = rand('uniform');
		output;
	end;
run;

proc sort;
	by RN; /* Sort by a random number to have the fake keys in no particular order */
run;

data FinalFakeKeys(drop=RN);
	set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;

/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
	set FinalFakeKeys(obs=3);
run;

/* Get all of the unique key values from your data */
proc sql noprint;
	create table UniqueKeys as
		select distinct SSN from have;
quit;

/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
	set UniqueKeys;
	set FinalFakeKeys;
run;

/* And finally substitute in the fake keys to your original data */
proc sql noprint;
	create table have_with_fake_keys as
		select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
	from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;

Tom

 

Occasional Contributor
Posts: 7

Re: How do I create an ID variable accounting for duplicates?

Thanks for helping me out even though I did not provide an example data set - still learning Smiley Happy Invaluable help, I had not managed to do this on my own!

PROC Star
Posts: 1,288

Re: How do I create an ID variable accounting for duplicates?

Glad I could help. Let's just say I've been doing this a LOOONNNNGGGGGGG time!

 

Tom

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 293 views
  • 0 likes
  • 4 in conversation