On SAS ver 9.4
This is a very broad question, and I hope I just need to be pointed in the right direction:)
I have a number of datasets that I need to generate examples from. The columns in the examples should contain random values that could have appeared in the original variable. So basically I want to copy the structure of the datasets, while populating the columns with random values.
My primary challenge is the scale. I have 20+ data sets with as many as 100 variables that can contain dates, character or numerical values. Hence, I would love to have a piece of code that is easy to apply to different datasets.
Regards
Jeppe
That might be very easy for IML code. Make a macro to go through all your datasets.
data have;
infile cards expandtabs truncover;
input var1 $ var2 var3 : date9. var4 $;
format var3 date9.;
cards;
M 1 31aug2016 x
F 2 01sep2016 f
M 3 02sep2016 g
F 4 03sep2016 h
;
run;
proc iml;
use have nobs nobs;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
read all var _num_ into x1;
read all var _char_ into y1;
close;
do j=1 to ncol(x);
x1[,j]=t(sample(x[,j],nobs));
end;
do j=1 to ncol(y);
y1[,j]=t(sample(y[,j],nobs));
end;
create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;
data temp;
merge char num;
run;
data want;
set have;
stop;
run;
proc append base=want data=temp force;
run;
Posting example test data in the form of a datastep, and what the output should look like would be a good start. Does it need to be actual values like in the data - i.e. what about character, code lists etc. do these have to match. Why not just x the data, such like:
if character then output X to the length of variable
if numeric then output 0 to the length of the variable including . if float
So:
Var1 Var2 Var3
XXXX 0.00 000
That would be far easier than trying to create lists of valid values on all those columns, then randomising it, then populating.
Thank you for the quick reply.
I have tried to describe the input and output data in words, but here's an example:
Input data
var1 var2 var3
M 1 31aug2016
F 2 01sep2016
M 3 02sep2016
F 4 03sep2016
Output data
var1 var2 var3
F 3 01sep2016
F 2 31aug2016
M 3 01sep2016
F 1 03sep2016
It is important to me that the generated example datasets contain values that could have appeared in the original data. So I am afraid your suggestion does not solve my problem.
Ah, ok. Then I am sorry, this is a vast topic. I assume your in the pharma industry, and data anonymisation is a big topic at the moment. Basically its not simple. Sure, with code list items like sex, you can take a distinct list, then divide the total items by a random number and assign that. However from there onwards it gets far more complicated. Dates for instance, should be within the study, within visit schedule. Values from labs should be both with ranges, outside ranges, abnormal values (normally assigned by lab tech) etc. So your question is how to build a data anonymisation system - way outside the scope of this Q&A forum.
How many records do you want "simulating" each data set?
Or would a random sample of your existing data work? This might make more sense.
This would select 50 random records from your data set:
proc surveyselect data=have out=selected sampsize=50;
run;
or this would select a percentage of the records, 30 percent in the example:
proc surveyselect data=have out=selected samprate=30;
run;
Note that we don't have to name any variables at all. If you do not want the Personally identifiable variables then use a dataset option Drop on either the input data: data=have(drop=id) or the output data set: out=selected (drop=Id)
This is indeed a very large topic. To get you started, here's an example that will cover your sample data:
data want(drop=_:);
call streaminit(1704211113);
do _i = 1 to 1000;
var1 = choosec((floor(rand('uniform') * 2) + 1), "F", "M");
var2 = floor(rand('uniform') * 4) + 1;
var3 = floor(rand('uniform') * ('31dec2016'd - '01aug2016'd + 1)) + '01aug2016'd;
format var3 date9.;
output;
end;
run;
However, your challenge will be to extend this to a large number of tables and variables. I could see the possibility of some sort of metadata-driven approach, but that's something that I would have to charge to investigate and develop.
Tom
I will say that I have done something similar to this with rules: If this variable is in this range then another variable is either blank or exceed/be less than a value. For about 30 variables this was roughly a week of work. And if any of the rules changes then it can add a significant addition of time.
Note that you may have issues with "random" generation when there should be internal constraints. For instance if this is health data there are procedures/diagnoses that are gender linked and often age associated. So a "simple random" approach could generate a Male aged 3 years that is pregnant/diagnosed with uterine cancer or similar.
Other fields have difference referential rules that should be enforced: You can't die before birth (in a Vital Records legal sense) or take money from an account before it is deposited (loan yes, withdraw no).
My thoughts:
Hope this helps. Does it?
Patrick
That might be very easy for IML code.
Make a macro to go through all your datases.
data have;
infile cards expandtabs truncover;
input var1 $ var2 var3 : date9. var4 $;
format var3 date9.;
cards;
M 1 31aug2016 x
F 2 01sep2016 f
M 3 02sep2016 g
F 4 03sep2016 h
;
run;
proc iml;
use have;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
read all var _num_ into x1;
read all var _char_ into y1;
close;
do i=1 to nrow(x);
do j=1 to ncol(x);
x1[i,j]=sample(x[,j],1);
end;
end;
do i=1 to nrow(y);
do j=1 to ncol(y);
y1[i,j]=sample(y[,j],1);
end;
end;
create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;
data temp;
merge char num;
run;
data want;
set have;
stop;
run;
proc append base=want data=temp force;
run;
That might be very easy for IML code. Make a macro to go through all your datasets.
data have;
infile cards expandtabs truncover;
input var1 $ var2 var3 : date9. var4 $;
format var3 date9.;
cards;
M 1 31aug2016 x
F 2 01sep2016 f
M 3 02sep2016 g
F 4 03sep2016 h
;
run;
proc iml;
use have nobs nobs;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
read all var _num_ into x1;
read all var _char_ into y1;
close;
do j=1 to ncol(x);
x1[,j]=t(sample(x[,j],nobs));
end;
do j=1 to ncol(y);
y1[,j]=t(sample(y[,j],nobs));
end;
create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;
data temp;
merge char num;
run;
data want;
set have;
stop;
run;
proc append base=want data=temp force;
run;
Thank you all for your input. This is the first time I ask a question here, and I am blown away by the helpfullness and kindness of you all.
I think I'll work with the proc iml solution for now.
Follow-up questions incoming soon I'm sure:)
How would you modify this to generate a specified number of records? This solution generates the same number of records as were existing in the dataset which we sampled from.
data have;
infile cards expandtabs truncover;
input var1 $ var2 var3 : date9. var4 $;
format var3 date9.;
cards;
M 1 31aug2016 x
F 2 01sep2016 f
M 3 02sep2016 g
F 4 03sep2016 h
;
run;
%let nobs=10;
proc iml;
use have ;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
close;
x1=j(&nobs,ncol(x),.);
y1=j(&nobs,ncol(x),nleng(y));
do j=1 to ncol(x);
x1[,j]=t(sample(x[,j],&nobs));
end;
do j=1 to ncol(y);
y1[,j]=t(sample(y[,j],&nobs));
end;
create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;
Thank you for your prompt reply.
A few questions:
For the following line
y1=j(&nobs,ncol(x),nleng(y));
I am assuming it is meant to be ncol(y) instead of ncol(x)
Furthermore, for the following lines of code:
do j=1 to ncol(y);
y1[,j]=t(sample(y[,j],&nobs));
end;
I don't see any problem with it but SAS Studio is throwing an error
ERROR: (execution) Invalid argument to function.
Sorry about that. This one should work .
EDITED.
data have;
infile cards expandtabs truncover;
input var1 $ var2 var3 : date9. var4 $;
format var3 date9.;
cards;
M 1 31aug2016 x
F 2 01sep2016 f
M 3 02sep2016 g
F 4 03sep2016 h
;
run;
%let nobs=10;
proc iml;
use have ;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
close;
x1=j(&nobs,ncol(x),.);
y1=j(&nobs,ncol(y),blankstr(nleng(y)));
do j=1 to ncol(x);
x1[,j]=t(sample(x[,j],&nobs));
end;
do j=1 to ncol(y);
y1[,j]=t(sample(y[,j],&nobs));
end;
create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;
ED
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.