BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ogee
Fluorite | Level 6

Hi all,

 

Please can anyone share some coding language (simple and efficient) on how to create a dummy data that mimics the structure of an original data set that one may not have access to? The goal is to ensure that if I run a proc means on the dummy data set, I should get almost similar results as those gotten if I were to run a proc means on the original data set. Any tip is welcome. Thanks in advance. 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

If you have reason to believe that you have continuous variables  that are close to normally distributed you could do something similar to:

proc summary data=sashelp.class ;
var height weight;
output out=classsummary mean= stddev= / autoname;
run;


data trial;
set classsummary;
do i=1 to 20;
modheight = rand('normal',height_mean,height_stddev);
modweight = rand('normal',weight_mean,weight_stddev);
output;
end;
keep modheight modweight;
run;
proc means data=trial mean stddev; run;

If you know your data is some other distribution similar may be possible by replacing the 'normal' with the appropriate distribution from the RAND documentation along with the required parameters. You would have to get the parameters from Proc means/summary or univariate somewhere, i.e. your existing data.

View solution in original post

8 REPLIES 8
mkeintz
PROC Star

If you don't have access to the original data set, how do you expect to know its structure?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ogee
Fluorite | Level 6
Sorry, I should be more clear... so let's say you know the variables and metadata of the original data set or have confidential information in the original data set that prevents you from sharing the results, how can one generate a dummy data set that mimics the structure of the original data set given the limited information on the original data set that one knows. Thanks in advance.
ballardw
Super User

Continuous or discrete values?

Hint: with discrete values you may be more interested in percentage of occurrence than "mean"

Do you expect to show a similar association with any other variables?

How many dummy observations do you intend to create? How "close" is close enough?

Do you already know the mean, standard deviation (and maybe skewness and kurtosis) of the variables of interest?

PGStats
Opal | Level 21

You have to decide in advance which features of the original data you want to mimic. You may want to match means, variances, distributions, correlations, ... the list is endless. Is using the original data with anonymised IDs a possibility?

PG
Ogee
Fluorite | Level 6
Unfortunately, using the anonymized IDs is not possible. They are continuous variables and I want to match at least means and variances, and show a similar association with other variables in the dummy data set like odds ratio, as would be expected if using the original data set. ‘Close’ Shouldn’t be exactly the same but realistic enough to an ordinary eye. Any more thoughts? Thanks much!
ballardw
Super User

If you have reason to believe that you have continuous variables  that are close to normally distributed you could do something similar to:

proc summary data=sashelp.class ;
var height weight;
output out=classsummary mean= stddev= / autoname;
run;


data trial;
set classsummary;
do i=1 to 20;
modheight = rand('normal',height_mean,height_stddev);
modweight = rand('normal',weight_mean,weight_stddev);
output;
end;
keep modheight modweight;
run;
proc means data=trial mean stddev; run;

If you know your data is some other distribution similar may be possible by replacing the 'normal' with the appropriate distribution from the RAND documentation along with the required parameters. You would have to get the parameters from Proc means/summary or univariate somewhere, i.e. your existing data.

Ogee
Fluorite | Level 6

Thanks all for your thoughtful response to my questions. Much appreciated!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1393 views
  • 5 likes
  • 4 in conversation