I found the following syntax to randomly generate age for 200 observations with specified values for mean and S.D. I would also like to restrict the age values to be between 18 and 100 and would like to specify the median value. Could anyone suggest how I can add syntax or modify this code to do so? Thank you.
%Let points = 1;
%Let mu = 71.1;
%Let sigma = 11.8;
%Let norm = rand('normal',&mu,&sigma);
data two; set test;
call streaminit (123);
do x=1 to &points;
age=&norm;
output;
end;
run;
It sounds like you do not want a Normal distribution, so what distribution should be used?
Once you put minimum and maximum values on a distribution, it can't be normal (but it may be approximately normal, depending on your definition of approximately).
We are still waiting for you to tell us what distribution you want.
For a normal distribution the median should be the mean. If you want a different value for median than the mean then you may not want a normal distribution.
Chopping off the upper and lower at different standard deviations from the mean will move the mean/median values of your "normal" distribution. Your proposed range cuts off more of the upper tail (at about 2.4 sigma) than the lower (about -4.5 sigma) with your stated mean and standard deviation for the rand 'normal' so would tend to make the resulting observed mean/median LOWER than the specified mu.
Test the value of each generated age. If not in the range, then do the call to the rand function again. Look at the much maligned GOTO and LABEL in the data step for one way.
Thank you for your reply. I believe I should clarify my goal. I want to create a fake data set based on existing data. So for example, if the existing data set has mean age and S.D. of 78 (11) then I want to create fake data with age that has the same mean and S.D. but also want the min and max age to be the same in the fake data as the real data set. I want to duplicate this for other variables of interest from the original data set.
@MichelleR0 wrote:
Thank you for your reply. I believe I should clarify my goal. I want to create a fake data set based on existing data. So for example, if the existing data set has mean age and S.D. of 78 (11) then I want to create fake data with age that has the same mean and S.D. but also want the min and max age to be the same in the fake data as the real data set. I want to duplicate this for other variables of interest from the original data set.
And median? Or are you dropping that part? Or do you expect the result to match other summary statistics like Skewness, Kurtosis, or some other moment?
Did your original data pass any test for coming from a normal distribution? If not, why did you start with normal data simulation?
Thank you for the question. I would like the variables in the simulated data set to match the summary statistics (and moments? not sure what the difference is?) for each variable from the original data set. So, I would like the simulated data set to have the same mean, S.D., median, Skewness, etc for the continuous variables as the reference data set. I would also like to do the same for the categorical variables. If the reference data set had 60% male and 40% female, I would like the simulated data set to have the same proportion for gender, and so for other categorical variables.
I am new to data simulation so I am not sure what my first step should be when considering which procedure or method to use. I am not familiar with Monte Carlo, but I know it is used often for data simulation. Proc Surveyselect was recommended, but I'm not sure I understand the differences in these methods to know where to start.
You still have not told us what distribution to use. We need to know this. Saying you want a specific mean, standard deviation, etc. is not enough.
The best solution comes from @PGStats who recommended sampling from the original data, using the distribution of your original data; but you stated you don't have the original data available. So we need to know more than you have told us so far.
The simplest way is to sample from the empirical distribution that you are trying to match. You can use proc surveyselect with options method=urs and sampsize=200. Statistically, the new sample will have the same moments as the original sample.
I don't have access to the original data set. I need to create (simulate) a data set that will have the same moments (same results for distributions of characteristics, exposure, outcomes, etc) as the results reported from the original data set. I hope that helps clarify? Thanks.
Okay, then we return to the issue of what distribution do you want to generate, with given median min and max?
@PGStats. Thank you for your help with proc surveyselect. I am wondering, in what situations is proc survey select most often used? One of the other community members responded that this procedure does not in fact 'simulate' data - I now understand what was meant by that. I also better understand that my goal is to make sure the new simulated data set cannot be recognized as the original data set (I hope that makes sense?). However, I have read many discussions on this forum that it is difficult to simulate data for a large number of variables.
Thanks,
MichelleR0
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.