04-21-2017 08:08 AM
On SAS ver 9.4
This is a very broad question, and I hope I just need to be pointed in the right direction
I have a number of datasets that I need to generate examples from. The columns in the examples should contain random values that could have appeared in the original variable. So basically I want to copy the structure of the datasets, while populating the columns with random values.
My primary challenge is the scale. I have 20+ data sets with as many as 100 variables that can contain dates, character or numerical values. Hence, I would love to have a piece of code that is easy to apply to different datasets.
04-21-2017 08:39 AM
Posting example test data in the form of a datastep, and what the output should look like would be a good start. Does it need to be actual values like in the data - i.e. what about character, code lists etc. do these have to match. Why not just x the data, such like:
if character then output X to the length of variable
if numeric then output 0 to the length of the variable including . if float
Var1 Var2 Var3
XXXX 0.00 000
That would be far easier than trying to create lists of valid values on all those columns, then randomising it, then populating.
04-21-2017 09:09 AM
Thank you for the quick reply.
I have tried to describe the input and output data in words, but here's an example:
Input data var1 var2 var3 M 1 31aug2016 F 2 01sep2016 M 3 02sep2016 F 4 03sep2016 Output data var1 var2 var3 F 3 01sep2016 F 2 31aug2016 M 3 01sep2016 F 1 03sep2016
It is important to me that the generated example datasets contain values that could have appeared in the original data. So I am afraid your suggestion does not solve my problem.
04-21-2017 09:22 AM
Ah, ok. Then I am sorry, this is a vast topic. I assume your in the pharma industry, and data anonymisation is a big topic at the moment. Basically its not simple. Sure, with code list items like sex, you can take a distinct list, then divide the total items by a random number and assign that. However from there onwards it gets far more complicated. Dates for instance, should be within the study, within visit schedule. Values from labs should be both with ranges, outside ranges, abnormal values (normally assigned by lab tech) etc. So your question is how to build a data anonymisation system - way outside the scope of this Q&A forum.
04-21-2017 10:59 AM
How many records do you want "simulating" each data set?
Or would a random sample of your existing data work? This might make more sense.
This would select 50 random records from your data set:
proc surveyselect data=have out=selected sampsize=50;
or this would select a percentage of the records, 30 percent in the example:
proc surveyselect data=have out=selected samprate=30;
Note that we don't have to name any variables at all. If you do not want the Personally identifiable variables then use a dataset option Drop on either the input data: data=have(drop=id) or the output data set: out=selected (drop=Id)
04-21-2017 11:36 AM
This is indeed a very large topic. To get you started, here's an example that will cover your sample data:
do _i = 1 to 1000;
var1 = choosec((floor(rand('uniform') * 2) + 1), "F", "M");
var2 = floor(rand('uniform') * 4) + 1;
var3 = floor(rand('uniform') * ('31dec2016'd - '01aug2016'd + 1)) + '01aug2016'd;
format var3 date9.;
However, your challenge will be to extend this to a large number of tables and variables. I could see the possibility of some sort of metadata-driven approach, but that's something that I would have to charge to investigate and develop.
04-21-2017 12:07 PM
I will say that I have done something similar to this with rules: If this variable is in this range then another variable is either blank or exceed/be less than a value. For about 30 variables this was roughly a week of work. And if any of the rules changes then it can add a significant addition of time.
Note that you may have issues with "random" generation when there should be internal constraints. For instance if this is health data there are procedures/diagnoses that are gender linked and often age associated. So a "simple random" approach could generate a Male aged 3 years that is pregnant/diagnosed with uterine cancer or similar.
Other fields have difference referential rules that should be enforced: You can't die before birth (in a Vital Records legal sense) or take money from an account before it is deposited (loan yes, withdraw no).
04-21-2017 02:05 PM
Hope this helps. Does it?
04-22-2017 02:50 AM - edited 04-22-2017 02:53 AM
That might be very easy for IML code.
Make a macro to go through all your datases.
data have; infile cards expandtabs truncover; input var1 $ var2 var3 : date9. var4 $; format var3 date9.; cards; M 1 31aug2016 x F 2 01sep2016 f M 3 02sep2016 g F 4 03sep2016 h ; run; proc iml; use have; read all var _num_ into x[c=vname_num]; read all var _char_ into y[c=vname_char]; read all var _num_ into x1; read all var _char_ into y1; close; do i=1 to nrow(x); do j=1 to ncol(x); x1[i,j]=sample(x[,j],1); end; end; do i=1 to nrow(y); do j=1 to ncol(y); y1[i,j]=sample(y[,j],1); end; end; create num from x1[c=vname_num]; append from x1; close; create char from y1[c=vname_char]; append from y1; close; quit; data temp; merge char num; run; data want; set have; stop; run; proc append base=want data=temp force; run;
04-23-2017 06:28 AM
I believe the following would be more efficient. data have; infile cards expandtabs truncover; input var1 $ var2 var3 : date9. var4 $; format var3 date9.; cards; M 1 31aug2016 x F 2 01sep2016 f M 3 02sep2016 g F 4 03sep2016 h ; run; proc iml; use have nobs nobs; read all var _num_ into x[c=vname_num]; read all var _char_ into y[c=vname_char]; read all var _num_ into x1; read all var _char_ into y1; close; do j=1 to ncol(x); x1[,j]=t(sample(x[,j],nobs)); end; do j=1 to ncol(y); y1[,j]=t(sample(y[,j],nobs)); end; create num from x1[c=vname_num]; append from x1; close; create char from y1[c=vname_char]; append from y1; close; quit; data temp; merge char num; run; data want; set have; stop; run; proc append base=want data=temp force; run;
04-27-2017 04:14 AM
Thank you all for your input. This is the first time I ask a question here, and I am blown away by the helpfullness and kindness of you all.
I think I'll work with the proc iml solution for now.
Follow-up questions incoming soon I'm sure