DATA Step, Macro, Functions and more

Data set with 'dummied' variable values

Reply
Regular Contributor
Posts: 216

Data set with 'dummied' variable values

Hi

I have data sets that I need to send to a third party for some development (using something other than SAS). The files contain some variables that have confidential values. Is there an easy way to 'dummy' the values for these variables?

I was thinking of using either 'contains' or 'index' functions and looking for numbers or letters within the values and replacing them with other letters or numbers.

Is there a better way of doing this?

Paul

Super User
Posts: 6,936

Re: Data set with 'dummied' variable values

Create lookup tables by sorting for a certain variable with nodupkey.

Then create random values for these, and do a merge back into the original dataset(s) to replace the original values with consistent data ("X" will always be replaced by, say, "A")

Depending on the data, you could also use user generated value formats for the replacement, or a hash table.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 216

Re: Data set with 'dummied' variable values

The problem is there are thousands of unique values for each of the variables. Unfortunately, its not like a gender variable where there is a manageable number.

Paul

Super User
Posts: 17,818

Re: Data set with 'dummied' variable values

Straightforward replacement isn't a recommended solution because it can be reverse engineered if desired.

I think you'd have to provide more details on your situation to get more suggestions, or @data's suggestion is a good one.

Respected Advisor
Posts: 3,777

Re: Data set with 'dummied' variable values

Do they need real data at all?  Why not just make it all up?

Regular Contributor
Posts: 216

Re: Data set with 'dummied' variable values

It can be made up, but it has to look real. For example, our dockets are in the format of NN-20912-13. This can't be dummied to say something like 4830989453XXXXX. Dummied, it would be to look something like JH-84092-02.

And I can't make up hundreds of thousands of records for these values.. It has to be some kind of systematic dummy version of the original.

The variables themselves are things like docket numbers (see above), first/last names, sequential database IDs.

Paul

Respected Advisor
Posts: 3,777

Re: Data set with 'dummied' variable values

Just think how good you will be when you can write a program that makes data that looks real and how much you will learn.

Regular Contributor
Posts: 217

Re: Data set with 'dummied' variable values

This woks for a small dataset.

  data want;

       set sashelp.cars(keep=make model) ;

    length dummy_variable_value $ 12;

    true_variable_value = model;

    dummy_variable_value = 'AA-00000-00';

  retain cntj cnti prefix;

     if _n_ = 1 then do;

     prefix = 'AA';

     cnti = 00000;

  cntj = 00;

  end;

    num = _N_;

  if cnti > 99999 then do;

           if prefix = 'AA' then   prefix = 'AB';  /* add logic here to have as many prefixes as necessary */

           else if prefix = 'AB' then   prefix = 'AC';  /* add logic here to have as many prefixes as necessary */

              cnti=00000;

  end;

  if cntj > 99 then do;

       cntj=00;

   end;

               dummy_variable_value = strip(prefix||'-'||strip(left(put(cnti,z5.)))||'-'||strip(left(put(cntj,z2.))));

    cnti = cnti +  1;

  cntj = cntj +  1;

      /* add logic to compare the dummy with the actual so that they are never the same */

   /* add logic to make sure that no dummy variable value equals any actual value. */

      /* drop any unneeded variables */

    run;

  proc print data=want;run;

Regular Contributor
Posts: 216

Re: Data set with 'dummied' variable values

Hi Jwillis

I think something like that might work for our docket numbers because there are a limited number of prefixes.

Per Reeza above, someone could decode this, but it is going to a specific vendor who is signing a confidentiality agreement (and not available to the public), so I don't think this is a risk.

Let me see if I can come up with something like the above and test it.

Paul

PROC Star
Posts: 1,091

Re: Data set with 'dummied' variable values

There isn't a "solution" for creating dummy data, as it has to be based on the characteristics of the real data, and these can vary almost infinitely. Rather, I tend to think there are techniques that are useful in different situations, and I have a "toolbox" of code snippets that can be hauled out.

A big advantage of this approach is that the variables are completely synthetic; no risk of disclosure.

My background is in social surveys; here are a few examples that I've used over the years.

Tom

/*                                                            */
/* Create Age                                                 */
/*                                                            */

Age = floor(rand('uniform') * 115); /* Integer, 0 to 114 */

/*                                                            */
/* Create Gender                                              */
/*                                                            */

Gender = ceil(rand('uniform') * 2); /* Integer, 0 to 1*/

/*                                                            */
/* Create Marital Status                                      */
/* RANTBL is a lovely function for generating categorical     */
/* variables with known distributions                         */
/*                                                            */

MarStat = rantbl(0, 0.577139251, 0.280453179, 0.060497845, 0.05685567); /* 1 Married 2 Single 3 Divorced 4 Widowed 5 Separated */
                                                                        /* From Census of Canada, 2011 */
/*                                                            */
/* Create Income                                              */
/* Normal distribution gives it a bit of a natural look       */
/* when being graphed                                         */
/*                                                            */

Income = round(rand('normal', 50000, 15000), .01);
if Income < 0
then Income = 0;

/*                                                            */
/* Create an 8 digit key field, no leading zeroes             */
/*                                                            */

GeneratedKey = floor(rand('uniform') * 90000000) + 10000000;

Respected Advisor
Posts: 3,888

Re: Data set with 'dummied' variable values

As an addition to what posted: In the few times where I had to work with masked data the most annoying thing used to be masked business keys which were required for joining tables. Masking such keys should use an approach where the same source value gets translated into the same masked value in all tables. Below an idea of how this could be achieved without using a reversible algorithm.

data have;

  input docket :$11.;

  datalines;

NN-20912-13

NN-30912-13

BB-20912-13

NN-30912-13

;

run;

data masked_docket_lookup;

  set have(keep=docket);

  _masked_docket=docket;

  stop;

run;

%let rc=byte(65+floor(ranuni(0)*26));

%let rd=floor(ranuni(0)*10);

data want(drop=_Smiley Happy;

  if _n_=1 then

    do;

      if 0 then set have masked_docket_lookup;

      dcl hash Hdocket(dataset:'masked_docket_lookup');

      _rc=Hdocket.defineKey('docket');

      _rc=Hdocket.defineData(all:'y');

      _rc=Hdocket.defineDone();

    end;

  set have end=last;

  if Hdocket.find() ne 0 then

    do;

      _masked_docket=cats(&rc,&rc,'-',&rd,&rd,&rd,&rd,&rd,'-',&rd,&rd);

      _rc=Hdocket.add();

    end;

  docket=_masked_docket;

  if last then Hdocket.output(dataset:'masked_docket_lookup_tmp');

run;

proc datasets lib=work nolist;

  delete masked_docket_lookup;

  run;

  change masked_docket_lookup_tmp=masked_docket_lookup;

  run;

quit;

data have2;

  input docket :$11.;

  datalines;

NN-20912-13

CC-99999-13

;

run;

data want2(drop=_Smiley Happy;

  if _n_=1 then

    do;

      if 0 then set have masked_docket_lookup;

      dcl hash Hdocket(dataset:'masked_docket_lookup');

      _rc=Hdocket.defineKey('docket');

      _rc=Hdocket.defineData(all:'y');

      _rc=Hdocket.defineDone();

    end;

  set have2 end=last;

  if Hdocket.find() ne 0 then

    do;

      _masked_docket=cats(&rc,&rc,'-',&rd,&rd,&rd,&rd,&rd,'-',&rd,&rd);

      _rc=Hdocket.add();

    end;

  docket=_masked_docket;

  if last then Hdocket.output(dataset:'masked_docket_lookup_tmp');

run;

proc datasets lib=work nolist;

  delete masked_docket_lookup;

  run;

  change masked_docket_lookup_tmp=masked_docket_lookup;

  run;

quit;

Regular Contributor
Posts: 216

Re: Data set with 'dummied' variable values

Hi Patrick

I tried part of your code for alpha characters above, which is definitely beyond my SAS capabilities--in spite of the fact I have been programming with it for 3 years, and it worked great for the docket numbers! I pasted the modified version of it below.

My question is can this be modified to dummy a first, last, middle name field as in the below format?

Smith, John

Doe, Jane, V.

Jackson, Randy, G.

Hanson, Therese

data masked_docket_lookup;

  set af3(keep=docket);

  _masked_docket=docket;

  stop;

run;

%let rc=byte(65+floor(ranuni(0)*26));

%let rd=floor(ranuni(0)*10);

data want(drop=_Smiley Happy;

  if _n_=1 then

    do;

      if 0 then set af3 masked_docket_lookup;

      dcl hash Hdocket(dataset:'masked_docket_lookup');

      _rc=Hdocket.defineKey('docket');

      _rc=Hdocket.defineData(all:'y');

      _rc=Hdocket.defineDone();

    end;

  set af3 end=last;

  if Hdocket.find() ne 0 then

    do;

      _masked_docket=cats(&rc,&rc,'-',&rd,&rd,&rd,&rd,&rd,'-',&rd,&rd);

      _rc=Hdocket.add();

    end;

  docket=_masked_docket;

  if last then Hdocket.output(dataset:'masked_docket_lookup_tmp');

run;

Respected Advisor
Posts: 3,888

Re: Data set with 'dummied' variable values

Below could work:

data af3;

  infile datalines truncover;

  input name $40.;

  datalines;

Smith, John

Doe, Jane, V.

Jackson, Randy, G.

Hanson, Therese

Doe, Jane, V.

;

run;

data masked_name_lookup;

/*  set af3(keep=name);*/

  /* define _masked_name with minumum length required to save memory when loading into hash table */

  length _masked_name $22.;

  _masked_name=name;

  stop;

run;

%let rc=byte(65+floor(ranuni(0)*26));

%let rd=floor(ranuni(0)*10);

data want(drop=_Smiley Happy;

  if _n_=1 then

    do;

      if 0 then set af3 masked_name_lookup;

      dcl hash Hname(dataset:'masked_name_lookup');

      _rc=Hname.defineKey('name');

      _rc=Hname.defineData(all:'y');

      _rc=Hname.defineDone();

    end;

  set af3 end=last;

  if Hname.find() ne 0 then

    do;

      _masked_name=cats(&rc,&rc,&rc,&rc,&rc,&rc,&rc,&rc,&rc,&rc);

      _stop=countc(name,',')+1;

      do _i=2 to _stop;

        _masked_name=trim(_masked_name)||', '||&rc||&rc||&rc||&rc;

      end;

      _rc=Hname.add();

    end;

  name=_masked_name;

  if last then Hname.output(dataset:'masked_name_lookup_tmp');

run;

Respected Advisor
Posts: 3,888

Re: Data set with 'dummied' variable values

If you're not too fussy about how the masked values for the name string look like then below should work as well. This code uses md5() to create a 128Bit hash value, then applies a hex32 format to express these 128Bit in a 32 character string.

A md5() is to a certain degree reversible so to fully mask the data a 22 character sub-string with random starting point is selected as the masked value. That should make it factually impossible to revert the masked string back to the original value.

data af3;

  infile datalines truncover;

  input name $40.;

  datalines;

Smith, John

Doe, Jane, V.

Jackson, Randy, G.

Hanson, Therese

Doe, Jane, V.

;

run;

data masked_name_lookup;

/*  set af3(keep=name);*/

  /* define _masked_name with minumum length required to save memory when loading into hash table */

  length _masked_name $22.;

  _masked_name=name;

  stop;

run;

data want(drop=_Smiley Happy;

  if _n_=1 then

    do;

      if 0 then set af3 masked_name_lookup;

      dcl hash Hname(dataset:'masked_name_lookup');

      _rc=Hname.defineKey('name');

      _rc=Hname.defineData(all:'y');

      _rc=Hname.defineDone();

    end;

  set af3 end=last;

  if Hname.find() ne 0 then

    do;

      _masked_name=substrn(put(md5(name),hex32.),ceil(ranuni(0)*10),22);

      _rc=Hname.add();

    end;

  name=_masked_name;

  if last then Hname.output(dataset:'masked_name_lookup_tmp');

run;

If you're using SAS9.4 then instead of md5() you could use sha256() which creates a 256Bit hash value and though makes it even more impossible to revert the masked value back to it's original.

Below how the line of code would need to look like:

      _masked_name=substrn(put(sha256(name),hex64.),ceil(ranuni(0)*42),22);

Super User
Posts: 9,676

Re: Data set with 'dummied' variable values

If you and your client have the same character set like ASCII .. then could try this one .


data have;
  input docket :$30.;
  datalines;
NN-20912-13
NN-30912-13
BB-20912-13
NN-30912-13
;
run;
data encode;
 set have;
 length encode $ 30;
 do i=1 to length(docket);
  r=rank(substr(docket,i,1))+1;
  substr(encode,i,1)=byte( ifn(r=256,1,r) );
 end;
 drop i r;
run;
data decode;
 set encode;
 length decode $ 30;
 do i=1 to length(encode);
  r=rank(substr(encode,i,1))-1;
  substr(decode,i,1)=byte( ifn(r=0,255,r) );
 end;
 drop i r;
run;

Xia Keshan

Ask a Question
Discussion stats
  • 14 replies
  • 495 views
  • 1 like
  • 8 in conversation