Solved: How to generate a dummy data set for testing without access to the ori...

Ogee · Posted 04-13-2020 01:28 PM

Hi all,

Please can anyone share some coding language (simple and efficient) on how to create a dummy data that mimics the structure of an original data set that one may not have access to? The goal is to ensure that if I run a proc means on the dummy data set, I should get almost similar results as those gotten if I were to run a proc means on the original data set. Any tip is welcome. Thanks in advance.

ballardw · Posted 04-13-2020 03:20 PM

If you have reason to believe that you have continuous variables that are close to normally distributed you could do something similar to:

proc summary data=sashelp.class ;
var height weight;
output out=classsummary mean= stddev= / autoname;
run;


data trial;
   set classsummary;
   do i=1 to 20;
      modheight = rand('normal',height_mean,height_stddev);
      modweight = rand('normal',weight_mean,weight_stddev);
      output;
   end;
   keep modheight modweight;
run;


proc means data=trial mean stddev;
run;

If you know your data is some other distribution similar may be possible by replacing the 'normal' with the appropriate distribution from the RAND documentation along with the required parameters. You would have to get the parameters from Proc means/summary or univariate somewhere, i.e. your existing data.

View solution in original post

mkeintz · Posted 04-13-2020 01:32 PM

If you don't have access to the original data set, how do you expect to know its structure?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ogee · Posted 04-13-2020 01:38 PM

Sorry, I should be more clear... so let's say you know the variables and metadata of the original data set or have confidential information in the original data set that prevents you from sharing the results, how can one generate a dummy data set that mimics the structure of the original data set given the limited information on the original data set that one knows. Thanks in advance.

ballardw · Posted 04-13-2020 02:42 PM

Continuous or discrete values?

Hint: with discrete values you may be more interested in percentage of occurrence than "mean"

Do you expect to show a similar association with any other variables?

How many dummy observations do you intend to create? How "close" is close enough?

Do you already know the mean, standard deviation (and maybe skewness and kurtosis) of the variables of interest?

PGStats · Posted 04-13-2020 02:43 PM

You have to decide in advance which features of the original data you want to mimic. You may want to match means, variances, distributions, correlations, ... the list is endless. Is using the original data with anonymised IDs a possibility?

PG

Ogee · Posted 04-13-2020 02:53 PM

Unfortunately, using the anonymized IDs is not possible. They are continuous variables and I want to match at least means and variances, and show a similar association with other variables in the dummy data set like odds ratio, as would be expected if using the original data set. ‘Close’ Shouldn’t be exactly the same but realistic enough to an ordinary eye. Any more thoughts? Thanks much!

PGStats · Posted 04-13-2020 03:20 PM

Data simulation is well covered in @Rick_SAS book

https://support.sas.com/en/books/authors/rick-wicklin.html

PG

ballardw · Posted 04-13-2020 03:20 PM

If you have reason to believe that you have continuous variables that are close to normally distributed you could do something similar to:

proc summary data=sashelp.class ;
var height weight;
output out=classsummary mean= stddev= / autoname;
run;


data trial;
   set classsummary;
   do i=1 to 20;
      modheight = rand('normal',height_mean,height_stddev);
      modweight = rand('normal',weight_mean,weight_stddev);
      output;
   end;
   keep modheight modweight;
run;


proc means data=trial mean stddev;
run;

If you know your data is some other distribution similar may be possible by replacing the 'normal' with the appropriate distribution from the RAND documentation along with the required parameters. You would have to get the parameters from Proc means/summary or univariate somewhere, i.e. your existing data.

Ogee · Posted 04-15-2020 01:44 PM

Thanks all for your thoughtful response to my questions. Much appreciated!

How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Re: How to generate a dummy data set for testing without access to the original data set

Catch up on SAS Innovate 2026