How do i scramble the subject id across the datasets without making an...

Prudhvi_007 · Posted 12-01-2021 08:55 AM

I have 4 datasets - A,B,C,D. Each dataset has around 500 observations. The subjects need to be scrambled within the available subject ID's. For example Actual subjects are 101,102,103,104 , they need to be scrambled as 103,101,104,102 . I am instructed to merge 4 datasets and apply scrambling for those which have subjects in dataset A. For example if dataset A has 101 as subject id and any of the datasets B,C or D has 101 as Subject id it should be scrambled with same number. Unmatched subjects should remain as it is .

Eg 1. A-101 B-101 C-101 D-101 -Actual subjects.

A-103 B-103 C-103 D-103 - Scrambled subjects should be same across the datasets for a particular subject.

Eg 2. A-104 B-105 C-109 D-104 - Actual subjects

A-107 B-108 C-106 D-107 - Please observe scrambling for Dataset A and D.

Please help me with this requirement .

LinusH · Posted 12-01-2021 09:14 AM

Sorry, I don't really understand the requirement.

I would never tweak identifiers and map to other exisrting ones. Why?

For the sake of anomonymization I can see a reason to transform id's, but that is usuallly to a new value domain.

Nevertheless, I guess if you create a mapping table with original id, and the corresponding scrambled id, you can use that to remap data fram any data set.

Data never sleeps

Prudhvi_007 · Posted 12-03-2021 01:37 AM

Hi . Could please provide me a sample code for mapping table and remapping . It would be helpful.

s_lassen · Posted 12-01-2021 10:03 AM

Sort your datasets by ID.

Then create a dataset with the IDs from A in random order:

proc sql;                                  
  create table newids as select            
    id as new_id from a order by ranuni(0);
quit;

And put that together with the original IDs from A:

data scramble;   
  set a(keep=id);
  set newids;    
run;

Then for each of your original datasets, do the following:

data X;                    
  merge X(in=ok) scramble; 
  by id;                   
  if ok;                   
  id=coalesce(new_id,id);  
  drop new_id;             
run;

The COALESCE function is not necessary for the A dataset, but it does not hurt much (a very slight performance penalty).

How do i scramble the subject id across the datasets without making any change to actual subject id

Re: How do i scramble the subject id across the datasets without making any change to actual subject

Re: How do i scramble the subject id across the datasets without making any change to actual subject

Re: How do i scramble the subject id across the datasets without making any change to actual subject

How do i scramble the subject id across the datasets without making any change to actual subject id

Re: How do i scramble the subject id across the datasets without making any change to actual subject

Re: How do i scramble the subject id across the datasets without making any change to actual subject

Re: How do i scramble the subject id across the datasets without making any change to actual subject

Click image to register for webinar

Classroom Training Available!