BookmarkSubscribeRSS Feed
Smitha9
Fluorite | Level 6

Hi 

I have a request in SAS.

1. I have to create a unique random patient_number from digits 10000 to 99999.

2. This unique random number for each patient should not change if he/she visits and have records several times.

3. The new patients keeps adding to the data and the new patients need to have different unique random number different from the previous patients.

 

Please let me know if anyone could help me with the code.

 

thank you

 

12 REPLIES 12
Reeza
Super User

Here's one way. If you search on here you'll find many others.

EDIT: I'm not sure this deals with number 3 at the moment but it could be modified so that new ID's are given new numbers but that can also make them identifiable so you'll have to really ensure that it can't be identified easily. An MD5 approach may be better then. 

 

This is one quick way that would generate a unique identifier for each ID. 

You should test it though, I haven't. 

ID = MD5(reverse(MD5(ID)));

 

/*This program demonstrates how to create a basic anonymized 
key for a unique identifier. Ensure you set the value in CALL
STREAMINIT()/RANDOM_SEED macro variable to ensure you can 
replicate the keys if needed*/

%let random_seed = 30;

*list of unique values;
proc sql; 
create table unique_list as
select distinct name
from sashelp.class;
quit;

*add random values;
data random_values;
set unique_list;
call streaminit(&random_seed.);
rand = rand('normal', 50, 10);
run;

*sort;
proc sort data=random_values;
by rand;
run;

*Assign ID to N, note this is a character format;
data ID_key_pair;
set random_values;
label = put(_n_, z5.);

fmtname = 'anon_fmt';
type='C';
start=name;
run;

*Create a format;
proc format cntlin=id_key_pair;
run;

*Create dataset with anonymized IDs;
data want;
set sashelp.class;
RandomID = put(name, $anon_fmt.);
*drop name;
run;

Source: https://gist.github.com/statgeek/fd94b0b6e78815430c1340e8c19f8644

 

 

Smitha9
Fluorite | Level 6
visit1 visit2
memIDrandomID memIDrandomID
esd1er3re10021 rtn3te7pp29034
red2re4er12345 err3hr3ur43567
der5fe5kr32342 ery3ur8jl99823
err3hr3ur43567 red2re4er12345
ree6hr2uu32456 her3ur6er87345
ree9tr6rr31876 udr8tr9re98320
esd1er3re10021 esd1er3re10021

 

In the above example, there are 2 visits. Few of the visit1 patient visited again in visit2.

They got a same unique ID when visited the second time.

Can I get this is SAS code?

 

SASKiwi
PROC Star

@Smitha9  - The MD5 function will give you exactly the same hashed code for the same input. Try this:

data _null_;
  key1 = put(md5('esd1er3re'), $hex10.);
  key2 = put(md5('esd1er3re'), $hex10.);
  put _all_;
run;

Key1 and key2 are identical. 

Smitha9
Fluorite | Level 6

name   randomid

A            12345

B             23456

C            34567

A            34665

I am getting a different randomid for the same name "A"

 

I want the name "A" should have same randomid 12345 always.

Can you help me using the above data to show me in md5 or anyother function you think I can get the result I want?

 

Kurt_Bremser
Super User

Are your sure you have identical contents (no leading blanks etc)?

Run this for proof:

data _null_;
  key1 = put(md5('A'), $hex10.);
  key2 = put(md5('A'), $hex10.);
  put _all_;
run;
Smitha9
Fluorite | Level 6

data have;

name $;

cards;

abc

bcd

efg

abc

;

run;

 

can you let me know each data steps for the above data. I want name 'abc' have same random id everytime i run the code.

 

RichardDeVen
Barite | Level 11

A HASH object can maintain a lookup from an explicit patient id to a sequential alternate id.

 

Example:

data have;
  call streaminit(123);
  do _n_ = 1 to 250000;
    patid = rand('integer',1,90000);
    output;
  end;
run;

data want_alt_seq;
  set have;
  if _n_ = 1 then do;
    declare hash ALTIDS();
    ALTIDS.defineKey('patid');
    ALTIDS.defineData('alt_patid');
    ALTIDS.defineDone();
  end;

  if ALTIDS.find() ne 0 then do;
    alt_patid = 100000 + ALTIDS.NUM_ITEMS;
    ALTIDS.add();
  end;
run;

If you feel you must have a random instead of sequential alternate id, a POOL of values array can be randomized with swaps and selected from.

data want_alt_swap;
  set have;

  if _n_ = 1 then do;
    declare hash ALTIDS();
    ALTIDS.defineKey('patid');
    ALTIDS.defineData('alt_patid');
    ALTIDS.defineDone();

    array POOL(90000) _temporary_ (10000:99999);
    * randomize pool;
    do _n_ = 1 to dim(POOL);
      _ix = rand('INTEGER', dim(POOL));
      _h = POOL(_ix);
      POOL(_ix) = POOL(_n_);
      POOL(_n_) = _h;
    end;
  end;

  if ALTIDS.find() ne 0 then do;
    if ALTIDS.NUM_ITEMS = 90000 then do;
      put 'ERROR: Need more brains';
      stop;
    end;
    alt_patid = POOL(ALTIDS.NUM_ITEMS+1);
    ALTIDS.add();
  end;

  drop _:;
run;
Smitha9
Fluorite | Level 6

data have;

input name $;

cards;

abc

bcd

efg

abc

;

run;

 

 

data want;
if _n_=1 then do;
if 0 then set have;
declare hash h();
h.definekey('name');
h.definedata('id');
h.definedone();
end;
set have;
rc=h.find();
if rc ne 0 then do;n+1;id=n;h.add();end;
drop rc n;
run;

 

I am getting the random ID from 1. I want the random ID's from 10000 to 99999

 

Smitha9
Fluorite | Level 6

Here  are the requests for the SAS codes:

1. Create Unique 10000 to 99999 random numbers.

2. Give 2000 random numbers per ID

3. As the ID's are added eventually, the new ID's should get the remaining random numbers.

4. the code needs to be run again and again, until remaining random numbers are used.

5.If there any same ID's are coming into the new records, they should have the same random numbers given before.

6. I want the reproducible of the all the random ID's whenever the code is run. And the same ID's should get same random numbers if there are any duplicates in the data.

 

Please give me step by step of the code.

 

thank you

Smitha9
Fluorite | Level 6

I tried the code and it worked.

thank you for that.

There is one more added challenge in this.

The records are added frequently. So, I need to have the same randomID created when I rerun the code.

After I sign off the SAS server, I cant get the same random ID's similar to my precious one's.

Can you insert a seed or something inside the same code? 

 

thank you 

RichardDeVen
Barite | Level 11

A requirement of maintaining the anonymizing ID across steps will require storing the active hash in a secure permanent table between uses. The OUTPUT method saves a hash. The hash data can be reloaded by using the dataset: argument tag in the hash constructor. Adding rows to a master data set is often best done with Proc APPEND

 

Example:

Mask in this discussion means anonymized.

* Separate permanent libraries for different data set roles;

libname SECURE 'C:\Perm\LimitAccessFolder\HealthData';    * unmasked data library;
libname WIDER  'C:\Perm\WiderAccessFolder\HealthData';    * masked data library, for analytics and reporting;

* Macro for simulating unmasked data;

%macro simulate_data_feed(data,seed=0,N=25000,M=90000,Z=1);
  data &DATA;
    call streaminit(&seed);
    do _n_ = 1 to &N;
      patid = rand('integer',&Z,&M);
      output;
    end;
  run;
%mend;

* Macro for applying anonymized identifiers. Loaded priors, adds new as need and saves when done.

%macro seq_mask(data,var=patid,maskvar=patnum,maskdata=SECURE.maskhash,out=masked_data);
  %local error memname libname;
  %let error = 0;

  %let maskdata = %upcase(&maskdata);

  %let libname = %scan(&maskdata,2,.);
  %let memname = %scan(&maskdata,2,.);

  %if %length(&memname) = 0 %then %do;
    %put ERROR: &maskdata disallowed, WORK/USER library presumed insecure;
    %ABORT CANCEL;
  %end;

  %if &libname = WORK or &libname = USER %then %do;
    %put ERROR: &maskdata disallowed, WORK/USER library presumed insecure;
    %ABORT CANCEL;
  %end;

  data &out;
    set &data end=LASTROW;

    declare hash ALTIDS;

    if _n_ = 1 then do;
      if exist("&maskdata", 'DATA')
        then ALTIDS = _new_ hash (dataset:"&maskdata");
        else ALTIDS = _new_ Hash ();

      ALTIDS.defineKey("&var");
      ALTIDS.defineData("&var");
      ALTIDS.defineData("&maskvar");
      ALTIDS.defineDone();
    end;

    if ALTIDS.find() ne 0 then do;
      if ALTIDS.NUM_ITEMS = 90000 then do;
        put 'ERROR: No more slots for masking as 10000:99999';
        call symput('error', '1');
        stop;
      end;

      _new_count + 1;
      &maskvar = 10000 + ALTIDS.NUM_ITEMS; * random alt id is a maintained sequence number;
      ALTIDS.add();
    end;

    label &maskvar = "&var masked.";

    if LASTROW then do;
      put 'NOTE: ' _new_count 'new mask ids';
      ALTIDS.output(dataset:"&maskdata");
    end;

    drop _:;
  run;

  %if &error %then %ABORT CANCEL;
%mend;

* Use macros to mask the ids in some source data and append to a master;

options mprint;

* reset master new state;
* proc delete data=WIDER.health_data;
* proc delete data=SECURE.maskhash;
* run;

%simulate_data_feed(SECURE.day1, seed=20200101)
%simulate_data_feed(SECURE.day2, seed=20200102, N=100000)
%simulate_data_feed(SECURE.day3, seed=20200103)
%simulate_data_feed(SECURE.day4, seed=20200104)
%simulate_data_feed(SECURE.day5, seed=20200105)
%simulate_data_feed(SECURE.day6, seed=20200106)
%simulate_data_feed(SECURE.day7, seed=20200107, N=50000, M=250000, Z=100000)

%seq_mask(SECURE.day1, out=WORK.masked1)
proc append base=WIDER.health_data force data=WORK.masked1; run;

%seq_mask(SECURE.day2, out=WORK.masked_today)
proc append base=WIDER.health_data force data=WORK.masked_today; run;

%seq_mask(SECURE.day3, out=WORK.stage1)
proc append base=WIDER.health_data data=WORK.stage1; run;

%seq_mask(SECURE.day4, out=WORK.feed_special)
proc append base=WIDER.health_data data=WORK.feed_special; run;

%seq_mask(SECURE.day5, out=WORK.week1_day5)
proc append base=WIDER.health_data data=WORK.week1_day5; run;

%seq_mask(SECURE.day6, out=WORK.week1_day6)
proc append base=WIDER.health_data data=WORK.week1_day6; run;

%seq_mask(SECURE.day7, out=WORK.week1_day7)
proc append base=WIDER.health_data data=WORK.week1_day7; run;

 

NOTE: Be careful, if the partial data (masked data sets that are to be appended to a master masked data set) does not have a unique key, you will be able to accidentally append the same data more than once; not good for reporting and analysis.  If the data has a natural key (such as visit_guid, or claim_number+claim_line) be sure to index uniquely the master data set.  A unique index will prevent accidental re-appends.

 

A production level system would also log macro use, maintain raw data, have throughput metrics, have a framework for undoing/rollback, etc...

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 2886 views
  • 3 likes
  • 5 in conversation