BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jsch
Calcite | Level 5

On SAS ver 9.4

 

This is a very broad question, and I hope I just need to be pointed in the right direction:)

 

I have a number of datasets that I need to generate examples from. The columns in the examples should contain random values that could have appeared in the original variable. So basically I want to copy the structure of the datasets, while populating the columns with random values.

 

My primary challenge is the scale. I have 20+ data sets with as many as 100 variables that can contain dates, character or numerical values. Hence, I would love to have a piece of code that is easy to apply to different datasets.

 

Regards

Jeppe

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

That might be very easy for IML code. Make a macro to go through all your datasets.

 

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

proc iml;
use have nobs nobs;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];

read all var _num_ into x1;
read all var _char_ into y1;
close;

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

data temp;
 merge char num;
run;

data want;
 set have;
 stop;
run;
proc append base=want data=temp force;
run;

 

 

View solution in original post

15 REPLIES 15
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Posting example test data in the form of a datastep, and what the output should look like would be a good start.  Does it need to be actual values like in the data - i.e. what about character, code lists etc. do these have to match.  Why not just x the data, such like:

if character then output X to the length of variable

if numeric then output 0 to the length of the variable including . if float

So:
Var1   Var2   Var3

XXXX  0.00   000

 

That would be far easier than trying to create lists of valid values on all those columns, then randomising it, then populating.

jsch
Calcite | Level 5

Thank you for the quick reply.

 

I have tried to describe the input and output data in words, but here's an example:

 

Input data

var1	var2	var3
M		1		31aug2016
F		2		01sep2016
M		3		02sep2016
F		4		03sep2016


Output data

var1	var2	var3
F		3		01sep2016
F		2		31aug2016
M		3		01sep2016
F		1		03sep2016

 

It is important to me that the generated example datasets contain values that could have appeared in the original data. So I am afraid your suggestion does not solve my problem.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Ah, ok.  Then I am sorry, this is a vast topic.  I assume your in the pharma industry, and data anonymisation is a big topic at the moment.  Basically its not simple.  Sure, with code list items like sex, you can take a distinct list, then divide the total items by a random number and assign that.  However from there onwards it gets far more complicated.  Dates for instance, should be within the study, within visit schedule.  Values from labs should be both with ranges, outside ranges, abnormal values (normally assigned by lab tech) etc.  So your question is how to build a data anonymisation system - way outside the scope of this Q&A forum.

ballardw
Super User

How many records do you want "simulating" each data set?

Or would a random sample of your existing data work? This might make more sense.

 

This would select 50 random records from your data set:

 

proc surveyselect data=have out=selected sampsize=50;

run;

 

or this would select a percentage of the records, 30 percent in the example:

 

proc surveyselect data=have out=selected samprate=30;

run;

 

Note that we don't have to name any variables at all. If you do not want the Personally identifiable variables then use a dataset option Drop on either the input data: data=have(drop=id) or the output data set: out=selected (drop=Id)

 

TomKari
Onyx | Level 15

This is indeed a very large topic. To get you started, here's an example that will cover your sample data:

 

data want(drop=_:);

call streaminit(1704211113);

do _i = 1 to 1000;

var1 = choosec((floor(rand('uniform') * 2) + 1), "F", "M");

var2 = floor(rand('uniform') * 4) + 1;

var3 = floor(rand('uniform') * ('31dec2016'd - '01aug2016'd + 1)) + '01aug2016'd;

format var3 date9.;

output;

end;

run;

 

However, your challenge will be to extend this to a large number of tables and variables. I could see the possibility of some sort of metadata-driven approach, but that's something that I would have to charge to investigate and develop.

 

Tom

ballardw
Super User

I will say that I have done something similar to this with rules: If this variable is in this range then another variable is either blank or exceed/be less than a value. For about 30 variables this was roughly a week of work. And if any of the rules changes then it can add a significant addition of time.

Note that you may have issues with "random" generation when there should be internal constraints. For instance if this is health data there are procedures/diagnoses that are gender linked and often age associated. So a "simple random" approach could generate a Male aged 3 years that is pregnant/diagnosed with uterine cancer or similar.

Other fields have difference referential rules that should be enforced: You can't die before birth (in a Vital Records legal sense) or take money from an account before it is deposited (loan yes, withdraw no).

thomp7050
Pyrite | Level 9

My thoughts:

 

  1. Select variable list, table names, and data types, and length from dictionary.tables and dictionary.columns, joined, and store in a table
  2. In the table in #1, code those columns that should be incremented (e.g. primary/foreign keys, etc.), randomly generated (e.g. subjectID, etc.), and sampled (e.g. lookup values, DRG/CPT codes, etc.)
  3. For your tables in #1, specify how many records you would like within each table.  You could also select exactly the number of records that exist in the current tables.
  4. Create a macro that, for each table, creates a blank "dummy" table for each of the listed specifications.  Alternatively, create a copy of the table using a data step, set it as the original, and delete all of the records.
  5. Based on the selections in #2, fill each column in the table using a macro and the call execute function, depending on its type (e.g. increment row id, randomly populate "random generate" columns, select randomly from the list for "sampled" columns)
  6. If you wish, for those columns that are "sampled", you could calculate the odds of each distinct item occurring in the sampled column and then randomly roll the result to print each option.  For example, if it were a 1.3% chance that I would have a DRG of 731.05, then randomly roll the list, based on the odds, to print the lookup.

Hope this helps.  Does it?

 

Patrick 

Ksharp
Super User

 

That might be very easy for IML code.

Make a macro to go through all your datases.

 

 

 

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

proc iml;
use have;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];

read all var _num_ into x1;
read all var _char_ into y1;
close;

do i=1 to nrow(x);
 do j=1 to ncol(x);
  x1[i,j]=sample(x[,j],1);
 end;
end;

do i=1 to nrow(y);
 do j=1 to ncol(y);
  y1[i,j]=sample(y[,j],1);
 end;
end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

data temp;
 merge char num;
run;

data want;
 set have;
 stop;
run;
proc append base=want data=temp force;
run;
 
Ksharp
Super User

That might be very easy for IML code. Make a macro to go through all your datasets.

 

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

proc iml;
use have nobs nobs;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];

read all var _num_ into x1;
read all var _char_ into y1;
close;

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

data temp;
 merge char num;
run;

data want;
 set have;
 stop;
run;
proc append base=want data=temp force;
run;

 

 

jsch
Calcite | Level 5

Thank you all for your input. This is the first time I ask a question here, and I am blown away by the helpfullness and kindness of you all.

 

I think I'll work with the proc iml solution for now.

 

Follow-up questions incoming soon I'm sure:)

 

 

kanzash
Calcite | Level 5

 How would you modify this to generate a specified number of records? This solution generates the same number of records as were existing in the dataset which we sampled from. 

Ksharp
Super User
data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

%let nobs=10;
proc iml;
use have ;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
close;

x1=j(&nobs,ncol(x),.);
y1=j(&nobs,ncol(x),nleng(y));

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],&nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],&nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;
kanzash
Calcite | Level 5

Thank you for your prompt reply.

 

A few questions: 

 

For the following line 

 

y1=j(&nobs,ncol(x),nleng(y));

I am assuming it is meant to be ncol(y) instead of ncol(x)

 

Furthermore, for  the following lines of code:

 

do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],&nobs));
 end;

I don't see any problem with it but SAS Studio is throwing an error  

 

ERROR: (execution) Invalid argument to function.

operation : [ at line 104 column 18
operands : y1, , j, _TEM1003
 
 
Ksharp
Super User

Sorry about that. This one should work .

EDITED.

 

data have;
infile cards expandtabs truncover;
 input var1	$ var2	var3 : date9. var4 $;
 format var3 date9.;
cards;
M		1		31aug2016 x 
F		2		01sep2016 f
M		3		02sep2016 g
F		4		03sep2016 h
;
run;

%let nobs=10;
proc iml;
use have ;
read all var _num_ into x[c=vname_num];
read all var _char_ into y[c=vname_char];
close;

x1=j(&nobs,ncol(x),.);
y1=j(&nobs,ncol(y),blankstr(nleng(y)));

 do j=1 to ncol(x);
  x1[,j]=t(sample(x[,j],&nobs));
 end;

 do j=1 to ncol(y);
  y1[,j]=t(sample(y[,j],&nobs));
 end;


create num from x1[c=vname_num];
append from x1;
close;
create char from y1[c=vname_char];
append from y1;
close;
quit;

 ED 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 10579 views
  • 3 likes
  • 7 in conversation