Hi all,
Please can anyone share some coding language (simple and efficient) on how to create a dummy data that mimics the structure of an original data set that one may not have access to? The goal is to ensure that if I run a proc means on the dummy data set, I should get almost similar results as those gotten if I were to run a proc means on the original data set. Any tip is welcome. Thanks in advance.
If you have reason to believe that you have continuous variables that are close to normally distributed you could do something similar to:
proc summary data=sashelp.class ; var height weight; output out=classsummary mean= stddev= / autoname; run;
data trial;
set classsummary;
do i=1 to 20;
modheight = rand('normal',height_mean,height_stddev);
modweight = rand('normal',weight_mean,weight_stddev);
output;
end;
keep modheight modweight;
run;
proc means data=trial mean stddev; run;
If you know your data is some other distribution similar may be possible by replacing the 'normal' with the appropriate distribution from the RAND documentation along with the required parameters. You would have to get the parameters from Proc means/summary or univariate somewhere, i.e. your existing data.
If you don't have access to the original data set, how do you expect to know its structure?
Continuous or discrete values?
Hint: with discrete values you may be more interested in percentage of occurrence than "mean"
Do you expect to show a similar association with any other variables?
How many dummy observations do you intend to create? How "close" is close enough?
Do you already know the mean, standard deviation (and maybe skewness and kurtosis) of the variables of interest?
You have to decide in advance which features of the original data you want to mimic. You may want to match means, variances, distributions, correlations, ... the list is endless. Is using the original data with anonymised IDs a possibility?
Data simulation is well covered in @Rick_SAS book
https://support.sas.com/en/books/authors/rick-wicklin.html
If you have reason to believe that you have continuous variables that are close to normally distributed you could do something similar to:
proc summary data=sashelp.class ; var height weight; output out=classsummary mean= stddev= / autoname; run;
data trial;
set classsummary;
do i=1 to 20;
modheight = rand('normal',height_mean,height_stddev);
modweight = rand('normal',weight_mean,weight_stddev);
output;
end;
keep modheight modweight;
run;
proc means data=trial mean stddev; run;
If you know your data is some other distribution similar may be possible by replacing the 'normal' with the appropriate distribution from the RAND documentation along with the required parameters. You would have to get the parameters from Proc means/summary or univariate somewhere, i.e. your existing data.
Thanks all for your thoughtful response to my questions. Much appreciated!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.