turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- How do I create an ID variable accounting for dupl...

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-28-2017 07:53 AM

Hi,

I am working on a dataset containing information on childbirths where the same mother can be included multiple times with different children. To anonymize the data I want to replace the mothers' social security numbers with a random ID variable, but I need to keep track of the duplicate mothers and assign them the same ID.

This seemed like a fairly straight forward task but I am failing. Can someone please help me out?

Accepted Solutions

Solution

08-29-2017
04:54 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to andreas_lds

08-28-2017 10:31 AM

Here's a method that I've developed over the years.

- It's a little lengthy, but it has some advantages:

It guarantees that your fake key values will be unique;

It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;

It provides a lookup table so you can switch back and forth;

Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

```
/* Generate some synthetic records to represent your actual data */
data have;
input SSN Var1 Var2; /* etc. */
cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;
/* The record count in the log is the number of unique keys */
proc sql noprint;
create table RecordCount as
select distinct SSN from have;
quit;
```

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

```
/* Generate some synthetic records to represent your actual data */
data have;
input SSN Var1 Var2; /* etc. */
cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;
/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
FakeKey = _KeyBigPart + _KeyLittlePart;
RN = rand('uniform');
output;
end;
run;
proc sort;
by RN; /* Sort by a random number to have the fake keys in no particular order */
run;
data FinalFakeKeys(drop=RN);
set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;
/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
set FinalFakeKeys(obs=3);
run;
/* Get all of the unique key values from your data */
proc sql noprint;
create table UniqueKeys as
select distinct SSN from have;
quit;
/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
set UniqueKeys;
set FinalFakeKeys;
run;
/* And finally substitute in the fake keys to your original data */
proc sql noprint;
create table have_with_fake_keys as
select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;
```

Tom

All Replies

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to chrnie

08-28-2017 08:33 AM

You can create a format for the SSns and use that; if needed, you can create a reverse format for recreating the original SSN:

```
data have;
input mother_s_ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;
proc sort
data=have (keep=mother_s_ssn)
out=cntlin (rename=(mother_s_ssn=start))
nodupkey
;
by mother_s_ssn;
run;
data cntlin;
set cntlin;
by start;
retain label 0;
label + 1;
fmtname = "m_ssn";
type = 'C';
if last.start;
run;
proc format library=work cntlin=cntlin;
run;
data anon;
set have;
mother_s_ssn = put(mother_s_ssn,$m_ssn.);
run;
```

---------------------------------------------------------------------------------------------

Maxims of Maximally Efficient SAS Programmers

How to convert datasets to data steps

How to post code

Maxims of Maximally Efficient SAS Programmers

How to convert datasets to data steps

How to post code

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to chrnie

08-28-2017 09:57 AM

@KurtBremser: Thanks for the example dataset ;-)

You can use a one-way-hash-function:

```
data have_ssn;
input ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;
data want_ssn;
set work.have_ssn(rename=(ssn = ori_ssn));
length ssn $ 32;
format ssn $hex64.;
ssn = sha256(ori_ssn);
drop ori_ssn;
run;
```

Solution

08-29-2017
04:54 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to andreas_lds

08-28-2017 10:31 AM

Here's a method that I've developed over the years.

- It's a little lengthy, but it has some advantages:

It guarantees that your fake key values will be unique;

It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;

It provides a lookup table so you can switch back and forth;

Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

```
/* Generate some synthetic records to represent your actual data */
data have;
input SSN Var1 Var2; /* etc. */
cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;
/* The record count in the log is the number of unique keys */
proc sql noprint;
create table RecordCount as
select distinct SSN from have;
quit;
```

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

```
/* Generate some synthetic records to represent your actual data */
data have;
input SSN Var1 Var2; /* etc. */
cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;
/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
FakeKey = _KeyBigPart + _KeyLittlePart;
RN = rand('uniform');
output;
end;
run;
proc sort;
by RN; /* Sort by a random number to have the fake keys in no particular order */
run;
data FinalFakeKeys(drop=RN);
set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;
/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
set FinalFakeKeys(obs=3);
run;
/* Get all of the unique key values from your data */
proc sql noprint;
create table UniqueKeys as
select distinct SSN from have;
quit;
/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
set UniqueKeys;
set FinalFakeKeys;
run;
/* And finally substitute in the fake keys to your original data */
proc sql noprint;
create table have_with_fake_keys as
select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;
```

Tom

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to chrnie

08-29-2017 04:57 AM

Thanks for helping me out even though I did not provide an example data set - still learning Invaluable help, I had not managed to do this on my own!

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to chrnie

08-29-2017 12:05 PM

Glad I could help. Let's just say I've been doing this a LOOONNNNGGGGGGG time!

Tom