BookmarkSubscribeRSS Feed
Paul_NYS
Obsidian | Level 7

Hi

I have data sets that I need to send to a third party for some development (using something other than SAS). The files contain some variables that have confidential values. Is there an easy way to 'dummy' the values for these variables?

I was thinking of using either 'contains' or 'index' functions and looking for numbers or letters within the values and replacing them with other letters or numbers.

Is there a better way of doing this?

Paul

14 REPLIES 14
Kurt_Bremser
Super User

Create lookup tables by sorting for a certain variable with nodupkey.

Then create random values for these, and do a merge back into the original dataset(s) to replace the original values with consistent data ("X" will always be replaced by, say, "A")

Depending on the data, you could also use user generated value formats for the replacement, or a hash table.

Paul_NYS
Obsidian | Level 7

The problem is there are thousands of unique values for each of the variables. Unfortunately, its not like a gender variable where there is a manageable number.

Paul

Reeza
Super User

Straightforward replacement isn't a recommended solution because it can be reverse engineered if desired.

I think you'd have to provide more details on your situation to get more suggestions, or @data's suggestion is a good one.

data_null__
Jade | Level 19

Do they need real data at all?  Why not just make it all up?

Paul_NYS
Obsidian | Level 7

It can be made up, but it has to look real. For example, our dockets are in the format of NN-20912-13. This can't be dummied to say something like 4830989453XXXXX. Dummied, it would be to look something like JH-84092-02.

And I can't make up hundreds of thousands of records for these values.. It has to be some kind of systematic dummy version of the original.

The variables themselves are things like docket numbers (see above), first/last names, sequential database IDs.

Paul

data_null__
Jade | Level 19

Just think how good you will be when you can write a program that makes data that looks real and how much you will learn.

jwillis
Quartz | Level 8

This woks for a small dataset.

  data want;

       set sashelp.cars(keep=make model) ;

    length dummy_variable_value $ 12;

    true_variable_value = model;

    dummy_variable_value = 'AA-00000-00';

  retain cntj cnti prefix;

     if _n_ = 1 then do;

     prefix = 'AA';

     cnti = 00000;

  cntj = 00;

  end;

    num = _N_;

  if cnti > 99999 then do;

           if prefix = 'AA' then   prefix = 'AB';  /* add logic here to have as many prefixes as necessary */

           else if prefix = 'AB' then   prefix = 'AC';  /* add logic here to have as many prefixes as necessary */

              cnti=00000;

  end;

  if cntj > 99 then do;

       cntj=00;

   end;

               dummy_variable_value = strip(prefix||'-'||strip(left(put(cnti,z5.)))||'-'||strip(left(put(cntj,z2.))));

    cnti = cnti +  1;

  cntj = cntj +  1;

      /* add logic to compare the dummy with the actual so that they are never the same */

   /* add logic to make sure that no dummy variable value equals any actual value. */

      /* drop any unneeded variables */

    run;

  proc print data=want;run;

Paul_NYS
Obsidian | Level 7

Hi Jwillis

I think something like that might work for our docket numbers because there are a limited number of prefixes.

Per Reeza above, someone could decode this, but it is going to a specific vendor who is signing a confidentiality agreement (and not available to the public), so I don't think this is a risk.

Let me see if I can come up with something like the above and test it.

Paul

TomKari
Onyx | Level 15

There isn't a "solution" for creating dummy data, as it has to be based on the characteristics of the real data, and these can vary almost infinitely. Rather, I tend to think there are techniques that are useful in different situations, and I have a "toolbox" of code snippets that can be hauled out.

A big advantage of this approach is that the variables are completely synthetic; no risk of disclosure.

My background is in social surveys; here are a few examples that I've used over the years.

Tom

/*                                                            */
/* Create Age                                                 */
/*                                                            */

Age = floor(rand('uniform') * 115); /* Integer, 0 to 114 */

/*                                                            */
/* Create Gender                                              */
/*                                                            */

Gender = ceil(rand('uniform') * 2); /* Integer, 0 to 1*/

/*                                                            */
/* Create Marital Status                                      */
/* RANTBL is a lovely function for generating categorical     */
/* variables with known distributions                         */
/*                                                            */

MarStat = rantbl(0, 0.577139251, 0.280453179, 0.060497845, 0.05685567); /* 1 Married 2 Single 3 Divorced 4 Widowed 5 Separated */
                                                                        /* From Census of Canada, 2011 */
/*                                                            */
/* Create Income                                              */
/* Normal distribution gives it a bit of a natural look       */
/* when being graphed                                         */
/*                                                            */

Income = round(rand('normal', 50000, 15000), .01);
if Income < 0
then Income = 0;

/*                                                            */
/* Create an 8 digit key field, no leading zeroes             */
/*                                                            */

GeneratedKey = floor(rand('uniform') * 90000000) + 10000000;

Patrick
Opal | Level 21

As an addition to what posted: In the few times where I had to work with masked data the most annoying thing used to be masked business keys which were required for joining tables. Masking such keys should use an approach where the same source value gets translated into the same masked value in all tables. Below an idea of how this could be achieved without using a reversible algorithm.

data have;

  input docket :$11.;

  datalines;

NN-20912-13

NN-30912-13

BB-20912-13

NN-30912-13

;

run;

data masked_docket_lookup;

  set have(keep=docket);

  _masked_docket=docket;

  stop;

run;

%let rc=byte(65+floor(ranuni(0)*26));

%let rd=floor(ranuni(0)*10);

data want(drop=_:);

  if _n_=1 then

    do;

      if 0 then set have masked_docket_lookup;

      dcl hash Hdocket(dataset:'masked_docket_lookup');

      _rc=Hdocket.defineKey('docket');

      _rc=Hdocket.defineData(all:'y');

      _rc=Hdocket.defineDone();

    end;

  set have end=last;

  if Hdocket.find() ne 0 then

    do;

      _masked_docket=cats(&rc,&rc,'-',&rd,&rd,&rd,&rd,&rd,'-',&rd,&rd);

      _rc=Hdocket.add();

    end;

  docket=_masked_docket;

  if last then Hdocket.output(dataset:'masked_docket_lookup_tmp');

run;

proc datasets lib=work nolist;

  delete masked_docket_lookup;

  run;

  change masked_docket_lookup_tmp=masked_docket_lookup;

  run;

quit;

data have2;

  input docket :$11.;

  datalines;

NN-20912-13

CC-99999-13

;

run;

data want2(drop=_:);

  if _n_=1 then

    do;

      if 0 then set have masked_docket_lookup;

      dcl hash Hdocket(dataset:'masked_docket_lookup');

      _rc=Hdocket.defineKey('docket');

      _rc=Hdocket.defineData(all:'y');

      _rc=Hdocket.defineDone();

    end;

  set have2 end=last;

  if Hdocket.find() ne 0 then

    do;

      _masked_docket=cats(&rc,&rc,'-',&rd,&rd,&rd,&rd,&rd,'-',&rd,&rd);

      _rc=Hdocket.add();

    end;

  docket=_masked_docket;

  if last then Hdocket.output(dataset:'masked_docket_lookup_tmp');

run;

proc datasets lib=work nolist;

  delete masked_docket_lookup;

  run;

  change masked_docket_lookup_tmp=masked_docket_lookup;

  run;

quit;

Paul_NYS
Obsidian | Level 7

Hi Patrick

I tried part of your code for alpha characters above, which is definitely beyond my SAS capabilities--in spite of the fact I have been programming with it for 3 years, and it worked great for the docket numbers! I pasted the modified version of it below.

My question is can this be modified to dummy a first, last, middle name field as in the below format?

Smith, John

Doe, Jane, V.

Jackson, Randy, G.

Hanson, Therese

data masked_docket_lookup;

  set af3(keep=docket);

  _masked_docket=docket;

  stop;

run;

%let rc=byte(65+floor(ranuni(0)*26));

%let rd=floor(ranuni(0)*10);

data want(drop=_:);

  if _n_=1 then

    do;

      if 0 then set af3 masked_docket_lookup;

      dcl hash Hdocket(dataset:'masked_docket_lookup');

      _rc=Hdocket.defineKey('docket');

      _rc=Hdocket.defineData(all:'y');

      _rc=Hdocket.defineDone();

    end;

  set af3 end=last;

  if Hdocket.find() ne 0 then

    do;

      _masked_docket=cats(&rc,&rc,'-',&rd,&rd,&rd,&rd,&rd,'-',&rd,&rd);

      _rc=Hdocket.add();

    end;

  docket=_masked_docket;

  if last then Hdocket.output(dataset:'masked_docket_lookup_tmp');

run;

Patrick
Opal | Level 21

Below could work:

data af3;

  infile datalines truncover;

  input name $40.;

  datalines;

Smith, John

Doe, Jane, V.

Jackson, Randy, G.

Hanson, Therese

Doe, Jane, V.

;

run;

data masked_name_lookup;

/*  set af3(keep=name);*/

  /* define _masked_name with minumum length required to save memory when loading into hash table */

  length _masked_name $22.;

  _masked_name=name;

  stop;

run;

%let rc=byte(65+floor(ranuni(0)*26));

%let rd=floor(ranuni(0)*10);

data want(drop=_:);

  if _n_=1 then

    do;

      if 0 then set af3 masked_name_lookup;

      dcl hash Hname(dataset:'masked_name_lookup');

      _rc=Hname.defineKey('name');

      _rc=Hname.defineData(all:'y');

      _rc=Hname.defineDone();

    end;

  set af3 end=last;

  if Hname.find() ne 0 then

    do;

      _masked_name=cats(&rc,&rc,&rc,&rc,&rc,&rc,&rc,&rc,&rc,&rc);

      _stop=countc(name,',')+1;

      do _i=2 to _stop;

        _masked_name=trim(_masked_name)||', '||&rc||&rc||&rc||&rc;

      end;

      _rc=Hname.add();

    end;

  name=_masked_name;

  if last then Hname.output(dataset:'masked_name_lookup_tmp');

run;

Patrick
Opal | Level 21

If you're not too fussy about how the masked values for the name string look like then below should work as well. This code uses md5() to create a 128Bit hash value, then applies a hex32 format to express these 128Bit in a 32 character string.

A md5() is to a certain degree reversible so to fully mask the data a 22 character sub-string with random starting point is selected as the masked value. That should make it factually impossible to revert the masked string back to the original value.

data af3;

  infile datalines truncover;

  input name $40.;

  datalines;

Smith, John

Doe, Jane, V.

Jackson, Randy, G.

Hanson, Therese

Doe, Jane, V.

;

run;

data masked_name_lookup;

/*  set af3(keep=name);*/

  /* define _masked_name with minumum length required to save memory when loading into hash table */

  length _masked_name $22.;

  _masked_name=name;

  stop;

run;

data want(drop=_:);

  if _n_=1 then

    do;

      if 0 then set af3 masked_name_lookup;

      dcl hash Hname(dataset:'masked_name_lookup');

      _rc=Hname.defineKey('name');

      _rc=Hname.defineData(all:'y');

      _rc=Hname.defineDone();

    end;

  set af3 end=last;

  if Hname.find() ne 0 then

    do;

      _masked_name=substrn(put(md5(name),hex32.),ceil(ranuni(0)*10),22);

      _rc=Hname.add();

    end;

  name=_masked_name;

  if last then Hname.output(dataset:'masked_name_lookup_tmp');

run;

If you're using SAS9.4 then instead of md5() you could use sha256() which creates a 256Bit hash value and though makes it even more impossible to revert the masked value back to it's original.

Below how the line of code would need to look like:

      _masked_name=substrn(put(sha256(name),hex64.),ceil(ranuni(0)*42),22);

Ksharp
Super User

If you and your client have the same character set like ASCII .. then could try this one .


data have;
  input docket :$30.;
  datalines;
NN-20912-13
NN-30912-13
BB-20912-13
NN-30912-13
;
run;
data encode;
 set have;
 length encode $ 30;
 do i=1 to length(docket);
  r=rank(substr(docket,i,1))+1;
  substr(encode,i,1)=byte( ifn(r=256,1,r) );
 end;
 drop i r;
run;
data decode;
 set encode;
 length decode $ 30;
 do i=1 to length(encode);
  r=rank(substr(encode,i,1))-1;
  substr(decode,i,1)=byte( ifn(r=0,255,r) );
 end;
 drop i r;
run;

Xia Keshan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1706 views
  • 1 like
  • 8 in conversation