SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

How do I generate random dataset from existing dataset

Reply
New Contributor
Posts: 3

How do I generate random dataset from existing dataset

On SAS ver 9.4

 

This is a very broad question, and I hope I just need to be pointed in the right directionSmiley Happy

 

I have a number of datasets that I need to generate examples from. The columns in the examples should contain random values that could have appeared in the original variable. So basically I want to copy the structure of the datasets, while populating the columns with random values.

 

My primary challenge is the scale. I have 20+ data sets with as many as 100 variables that can contain dates, character or numerical values. Hence, I would love to have a piece of code that is easy to apply to different datasets.

 

Regards

Jeppe

Super User
Super User
Posts: 7,950

Re: How do I generate random dataset from existing dataset

Posting example test data in the form of a datastep, and what the output should look like would be a good start.  Does it need to be actual values like in the data - i.e. what about character, code lists etc. do these have to match.  Why not just x the data, such like:

if character then output X to the length of variable

if numeric then output 0 to the length of the variable including . if float

So:
Var1   Var2   Var3

XXXX  0.00   000

 

That would be far easier than trying to create lists of valid values on all those columns, then randomising it, then populating.

New Contributor
Posts: 3

Re: How do I generate random dataset from existing dataset

Thank you for the quick reply.

 

I have tried to describe the input and output data in words, but here's an example:

 

Input data

var1	var2	var3
M		1		31aug2016
F		2		01sep2016
M		3		02sep2016
F		4		03sep2016


Output data

var1	var2	var3
F		3		01sep2016
F		2		31aug2016
M		3		01sep2016
F		1		03sep2016

 

It is important to me that the generated example datasets contain values that could have appeared in the original data. So I am afraid your suggestion does not solve my problem.

Super User
Super User
Posts: 7,950

Re: How do I generate random dataset from existing dataset

Ah, ok.  Then I am sorry, this is a vast topic.  I assume your in the pharma industry, and data anonymisation is a big topic at the moment.  Basically its not simple.  Sure, with code list items like sex, you can take a distinct list, then divide the total items by a random number and assign that.  However from there onwards it gets far more complicated.  Dates for instance, should be within the study, within visit schedule.  Values from labs should be both with ranges, outside ranges, abnormal values (normally assigned by lab tech) etc.  So your question is how to build a data anonymisation system - way outside the scope of this Q&A forum.

Super User
Posts: 11,343

Re: How do I generate random dataset from existing dataset

How many records do you want "simulating" each data set?

Or would a random sample of your existing data work? This might make more sense.

 

This would select 50 random records from your data set:

 

proc surveyselect data=have out=selected sampsize=50;

run;

 

or this would select a percentage of the records, 30 percent in the example:

 

proc surveyselect data=have out=selected samprate=30;

run;

 

Note that we don't have to name any variables at all. If you do not want the Personally identifiable variables then use a dataset option Drop on either the input data: data=have(drop=id) or the output data set: out=selected (drop=Id)

 

PROC Star
Posts: 1,167

Re: How do I generate random dataset from existing dataset

This is indeed a very large topic. To get you started, here's an example that will cover your sample data:

 

data want(drop=_Smiley Happy;

call streaminit(1704211113);

do _i = 1 to 1000;

var1 = choosec((floor(rand('uniform') * 2) + 1), "F", "M");

var2 = floor(rand('uniform') * 4) + 1;

var3 = floor(rand('uniform') * ('31dec2016'd - '01aug2016'd + 1)) + '01aug2016'd;

format var3 date9.;

output;

end;

run;

 

However, your challenge will be to extend this to a large number of tables and variables. I could see the possibility of some sort of metadata-driven approach, but that's something that I would have to charge to investigate and develop.

 

Tom

Super User
Posts: 11,343

Re: How do I generate random dataset from existing dataset

I will say that I have done something similar to this with rules: If this variable is in this range then another variable is either blank or exceed/be less than a value. For about 30 variables this was roughly a week of work. And if any of the rules changes then it can add a significant addition of time.

Note that you may have issues with "random" generation when there should be internal constraints. For instance if this is health data there are procedures/diagnoses that are gender linked and often age associated. So a "simple random" approach could generate a Male aged 3 years that is pregnant/diagnosed with uterine cancer or similar.

Other fields have difference referential rules that should be enforced: You can't die before birth (in a Vital Records legal sense) or take money from an account before it is deposited (loan yes, withdraw no).

Frequent Contributor
Posts: 93

Re: How do I generate random dataset from existing dataset

My thoughts:

 

  1. Select variable list, table names, and data types, and length from dictionary.tables and dictionary.columns, joined, and store in a table
  2. In the table in #1, code those columns that should be incremented (e.g. primary/foreign keys, etc.), randomly generated (e.g. subjectID, etc.), and sampled (e.g. lookup values, DRG/CPT codes, etc.)
  3. For your tables in #1, specify how many records you would like within each table.  You could also select exactly the number of records that exist in the current tables.
  4. Create a macro that, for each table, creates a blank "dummy" table for each of the listed specifications.  Alternatively, create a copy of the table using a data step, set it as the original, and delete all of the records.
  5. Based on the selections in #2, fill each column in the table using a macro and the call execute function, depending on its type (e.g. increment row id, randomly populate "random generate" columns, select randomly from the list for "sampled" columns)
  6. If you wish, for those columns that are "sampled", you could calculate the odds of each distinct item occurring in the sampled column and then randomly roll the result to print each option.  For example, if it were a 1.3% chance that I would have a DRG of 731.05, then randomly roll the list, based on the odds, to print the lookup.

Hope this helps.  Does it?

 

Patrick 

Super User
Posts: 10,028

Re: How do I generate random dataset from existing dataset

[ Edited ]

 

That might be very easy for IML code.

Make a macro to go through all your datases.

 

 

 

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

proc iml;
use have;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];

read all var _num_ into x1;
read all var _char_ into y1;
close;

do i=1 to nrow(x);
 do j=1 to ncol(x);
  x1[i,j]=sample(x[,j],1);
 end;
end;

do i=1 to nrow(y);
 do j=1 to ncol(y);
  y1[i,j]=sample(y[,j],1);
 end;
end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

data temp;
 merge char num;
run;

data want;
 set have;
 stop;
run;
proc append base=want data=temp force;
run;
 
Super User
Posts: 10,028

Re: How do I generate random dataset from existing dataset


I believe the following would be more efficient.


data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

proc iml;
use have nobs nobs;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];

read all var _num_ into x1;
read all var _char_ into y1;
close;

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

data temp;
 merge char num;
run;

data want;
 set have;
 stop;
run;
proc append base=want data=temp force;
run;

New Contributor
Posts: 3

Re: How do I generate random dataset from existing dataset

Thank you all for your input. This is the first time I ask a question here, and I am blown away by the helpfullness and kindness of you all.

 

I think I'll work with the proc iml solution for now.

 

Follow-up questions incoming soon I'm sureSmiley Happy

 

 

Ask a Question
Discussion stats
  • 10 replies
  • 437 views
  • 1 like
  • 6 in conversation