Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- How do I create an ID variable accounting for duplicates?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 08-28-2017 07:53 AM
(2523 views)

Hi,

I am working on a dataset containing information on childbirths where the same mother can be included multiple times with different children. To anonymize the data I want to replace the mothers' social security numbers with a random ID variable, but I need to keep track of the duplicate mothers and assign them the same ID.

This seemed like a fairly straight forward task but I am failing. Can someone please help me out?

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Here's a method that I've developed over the years.

- It's a little lengthy, but it has some advantages:

It guarantees that your fake key values will be unique;

It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;

It provides a lookup table so you can switch back and forth;

Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

```
/* Generate some synthetic records to represent your actual data */
data have;
input SSN Var1 Var2; /* etc. */
cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;
/* The record count in the log is the number of unique keys */
proc sql noprint;
create table RecordCount as
select distinct SSN from have;
quit;
```

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

```
/* Generate some synthetic records to represent your actual data */
data have;
input SSN Var1 Var2; /* etc. */
cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;
/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
FakeKey = _KeyBigPart + _KeyLittlePart;
RN = rand('uniform');
output;
end;
run;
proc sort;
by RN; /* Sort by a random number to have the fake keys in no particular order */
run;
data FinalFakeKeys(drop=RN);
set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;
/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
set FinalFakeKeys(obs=3);
run;
/* Get all of the unique key values from your data */
proc sql noprint;
create table UniqueKeys as
select distinct SSN from have;
quit;
/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
set UniqueKeys;
set FinalFakeKeys;
run;
/* And finally substitute in the fake keys to your original data */
proc sql noprint;
create table have_with_fake_keys as
select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;
```

Tom

5 REPLIES 5

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can create a format for the SSns and use that; if needed, you can create a reverse format for recreating the original SSN:

```
data have;
input mother_s_ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;
proc sort
data=have (keep=mother_s_ssn)
out=cntlin (rename=(mother_s_ssn=start))
nodupkey
;
by mother_s_ssn;
run;
data cntlin;
set cntlin;
by start;
retain label 0;
label + 1;
fmtname = "m_ssn";
type = 'C';
if last.start;
run;
proc format library=work cntlin=cntlin;
run;
data anon;
set have;
mother_s_ssn = put(mother_s_ssn,$m_ssn.);
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Kurt_Bremser: Thanks for the example dataset 😉

You can use a one-way-hash-function:

```
data have_ssn;
input ssn $ child;
cards;
4567 1
32676 1
4567 2
45678 1
;
run;
data want_ssn;
set work.have_ssn(rename=(ssn = ori_ssn));
length ssn $ 32;
format ssn $hex64.;
ssn = sha256(ori_ssn);
drop ori_ssn;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Here's a method that I've developed over the years.

- It's a little lengthy, but it has some advantages:

It guarantees that your fake key values will be unique;

It makes them fairly businesslike, i.e. a nine digit number with no leading zeroes;

It provides a lookup table so you can switch back and forth;

Given all of the above, it's reasonably efficient.

Step one: Since you have duplicate keys, you need to figure out how many unique key values you have. The record count in the log for datasets RecordCount is the number (in this case, 3).

```
/* Generate some synthetic records to represent your actual data */
data have;
input SSN Var1 Var2; /* etc. */
cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;
/* The record count in the log is the number of unique keys */
proc sql noprint;
create table RecordCount as
select distinct SSN from have;
quit;
```

Now that you've got that, you can use that number in the following example. I hope the comments are sufficient; if not, feel free to ask questions.

```
/* Generate some synthetic records to represent your actual data */
data have;
input SSN Var1 Var2; /* etc. */
cards;
123987456 1 2
458329782 3 4
937482961 5 6
458329782 7 8
run;
/* Assume you need 439,871 keys, each one a 9 digit number */
data FakeKeys(drop=_:);
do _KeyBigPart = 100000000 to 999999999 by 1000; /* Work with the numbers so you get 900,000 nine digit numbers */
_KeyLittlePart = int(rand('uniform') * 1000); /* Generate three random digits for the last three digits of the fake key */
FakeKey = _KeyBigPart + _KeyLittlePart;
RN = rand('uniform');
output;
end;
run;
proc sort;
by RN; /* Sort by a random number to have the fake keys in no particular order */
run;
data FinalFakeKeys(drop=RN);
set Keys(obs=439871); /* Keep exactly the number of records that you need */
run;
/* Of course, in this example, you only need three keys, so trim it down a bit more */
data FinalFakeKeys;
set FinalFakeKeys(obs=3);
run;
/* Get all of the unique key values from your data */
proc sql noprint;
create table UniqueKeys as
select distinct SSN from have;
quit;
/* Generate a lookup table between your real keys and your fake keys. Keep this! It's the only way you'll be able to go back to the actual values. */
data LookupTable;
set UniqueKeys;
set FinalFakeKeys;
run;
/* And finally substitute in the fake keys to your original data */
proc sql noprint;
create table have_with_fake_keys as
select f.FakeKey as SSN, h.Var1, h.Var2 /* etc. */
from have h inner join LookupTable f on(h.SSN = f.SSN);
quit;
```

Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Glad I could help. Let's just say I've been doing this a LOOONNNNGGGGGGG time!

Tom

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.