Solved: How do I generate random dataset from existing dataset

jsch · Posted 04-21-2017 08:08 AM

On SAS ver 9.4

This is a very broad question, and I hope I just need to be pointed in the right direction:)

I have a number of datasets that I need to generate examples from. The columns in the examples should contain random values that could have appeared in the original variable. So basically I want to copy the structure of the datasets, while populating the columns with random values.

My primary challenge is the scale. I have 20+ data sets with as many as 100 variables that can contain dates, character or numerical values. Hence, I would love to have a piece of code that is easy to apply to different datasets.

Regards

Jeppe

Ksharp · Posted 04-23-2017 06:28 AM

That might be very easy for IML code. Make a macro to go through all your datasets.

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

proc iml;
use have nobs nobs;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];

read all var _num_ into x1;
read all var _char_ into y1;
close;

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

data temp;
 merge char num;
run;

data want;
 set have;
 stop;
run;
proc append base=want data=temp force;
run;

View solution in original post

RW9 · Posted 04-21-2017 08:39 AM

Posting example test data in the form of a datastep, and what the output should look like would be a good start. Does it need to be actual values like in the data - i.e. what about character, code lists etc. do these have to match. Why not just x the data, such like:

if character then output X to the length of variable

if numeric then output 0 to the length of the variable including . if float

So:
Var1 Var2 Var3

XXXX 0.00 000

That would be far easier than trying to create lists of valid values on all those columns, then randomising it, then populating.

jsch · Posted 04-21-2017 09:09 AM

Thank you for the quick reply.

I have tried to describe the input and output data in words, but here's an example:

Input data

var1	var2	var3
M		1		31aug2016
F		2		01sep2016
M		3		02sep2016
F		4		03sep2016


Output data

var1	var2	var3
F		3		01sep2016
F		2		31aug2016
M		3		01sep2016
F		1		03sep2016

It is important to me that the generated example datasets contain values that could have appeared in the original data. So I am afraid your suggestion does not solve my problem.

RW9 · Posted 04-21-2017 09:22 AM

Ah, ok. Then I am sorry, this is a vast topic. I assume your in the pharma industry, and data anonymisation is a big topic at the moment. Basically its not simple. Sure, with code list items like sex, you can take a distinct list, then divide the total items by a random number and assign that. However from there onwards it gets far more complicated. Dates for instance, should be within the study, within visit schedule. Values from labs should be both with ranges, outside ranges, abnormal values (normally assigned by lab tech) etc. So your question is how to build a data anonymisation system - way outside the scope of this Q&A forum.

ballardw · Posted 04-21-2017 10:59 AM

How many records do you want "simulating" each data set?

Or would a random sample of your existing data work? This might make more sense.

This would select 50 random records from your data set:

proc surveyselect data=have out=selected sampsize=50;

run;

or this would select a percentage of the records, 30 percent in the example:

proc surveyselect data=have out=selected samprate=30;

run;

Note that we don't have to name any variables at all. If you do not want the Personally identifiable variables then use a dataset option Drop on either the input data: data=have(drop=id) or the output data set: out=selected (drop=Id)

TomKari · Posted 04-21-2017 11:36 AM

This is indeed a very large topic. To get you started, here's an example that will cover your sample data:

data want(drop=_:);

call streaminit(1704211113);

do _i = 1 to 1000;

var1 = choosec((floor(rand('uniform') * 2) + 1), "F", "M");

var2 = floor(rand('uniform') * 4) + 1;

var3 = floor(rand('uniform') * ('31dec2016'd - '01aug2016'd + 1)) + '01aug2016'd;

format var3 date9.;

output;

end;

run;

However, your challenge will be to extend this to a large number of tables and variables. I could see the possibility of some sort of metadata-driven approach, but that's something that I would have to charge to investigate and develop.

Tom

ballardw · Posted 04-21-2017 12:07 PM

I will say that I have done something similar to this with rules: If this variable is in this range then another variable is either blank or exceed/be less than a value. For about 30 variables this was roughly a week of work. And if any of the rules changes then it can add a significant addition of time.

Note that you may have issues with "random" generation when there should be internal constraints. For instance if this is health data there are procedures/diagnoses that are gender linked and often age associated. So a "simple random" approach could generate a Male aged 3 years that is pregnant/diagnosed with uterine cancer or similar.

Other fields have difference referential rules that should be enforced: You can't die before birth (in a Vital Records legal sense) or take money from an account before it is deposited (loan yes, withdraw no).

thomp7050 · Posted 04-21-2017 02:05 PM

My thoughts:

Select variable list, table names, and data types, and length from dictionary.tables and dictionary.columns, joined, and store in a table
In the table in #1, code those columns that should be incremented (e.g. primary/foreign keys, etc.), randomly generated (e.g. subjectID, etc.), and sampled (e.g. lookup values, DRG/CPT codes, etc.)
For your tables in #1, specify how many records you would like within each table. You could also select exactly the number of records that exist in the current tables.
Create a macro that, for each table, creates a blank "dummy" table for each of the listed specifications. Alternatively, create a copy of the table using a data step, set it as the original, and delete all of the records.
Based on the selections in #2, fill each column in the table using a macro and the call execute function, depending on its type (e.g. increment row id, randomly populate "random generate" columns, select randomly from the list for "sampled" columns)
If you wish, for those columns that are "sampled", you could calculate the odds of each distinct item occurring in the sampled column and then randomly roll the result to print each option. For example, if it were a 1.3% chance that I would have a DRG of 731.05, then randomly roll the list, based on the odds, to print the lookup.

Hope this helps. Does it?

Patrick

Ksharp · Posted 04-22-2017 02:50 AM

That might be very easy for IML code.

Make a macro to go through all your datases.

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

proc iml;
use have;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];

read all var _num_ into x1;
read all var _char_ into y1;
close;

do i=1 to nrow(x);
 do j=1 to ncol(x);
  x1[i,j]=sample(x[,j],1);
 end;
end;

do i=1 to nrow(y);
 do j=1 to ncol(y);
  y1[i,j]=sample(y[,j],1);
 end;
end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

data temp;
 merge char num;
run;

data want;
 set have;
 stop;
run;
proc append base=want data=temp force;
run;

Ksharp · Posted 04-23-2017 06:28 AM

That might be very easy for IML code. Make a macro to go through all your datasets.

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

proc iml;
use have nobs nobs;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];

read all var _num_ into x1;
read all var _char_ into y1;
close;

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

data temp;
 merge char num;
run;

data want;
 set have;
 stop;
run;
proc append base=want data=temp force;
run;

jsch · Posted 04-27-2017 04:14 AM

Thank you all for your input. This is the first time I ask a question here, and I am blown away by the helpfullness and kindness of you all.

I think I'll work with the proc iml solution for now.

Follow-up questions incoming soon I'm sure:)

kanzash · Posted 07-30-2018 04:03 PM

How would you modify this to generate a specified number of records? This solution generates the same number of records as were existing in the dataset which we sampled from.

Ksharp · Posted 07-31-2018 09:00 AM

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

%let nobs=10;
proc iml;
use have ;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
close;

x1=j(&nobs,ncol(x),.);
y1=j(&nobs,ncol(x),nleng(y));

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],&nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],&nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

kanzash · Posted 07-31-2018 09:39 AM

Thank you for your prompt reply.

A few questions:

For the following line

y1=j(&nobs,ncol(x),nleng(y));

I am assuming it is meant to be ncol(y) instead of ncol(x)

Furthermore, for the following lines of code:

do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],&nobs));
 end;

I don't see any problem with it but SAS Studio is throwing an error

ERROR: (execution) Invalid argument to function.

operation : [ at line 104 column 18

operands : y1, , j, _TEM1003

Ksharp · Posted 07-31-2018 09:53 AM

Sorry about that. This one should work .

EDITED.

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

%let nobs=10;
proc iml;
use have ;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
close;

x1=j(&nobs,ncol(x),.);
y1=j(&nobs,ncol(y),blankstr(nleng(y)));

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],&nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],&nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

ED

How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Re: How do I generate random dataset from existing dataset

Registration is open