BookmarkSubscribeRSS Feed
MichelleR0
Fluorite | Level 6

I found the following syntax to randomly generate age for 200 observations with specified values for mean and S.D.    I would also like to restrict the age values to be between 18 and 100 and would like to specify the median value.  Could anyone suggest how I can add syntax or modify this code to do so?  Thank you.

 

%Let points = 1; 
%Let mu = 71.1;
%Let sigma = 11.8;
%Let norm = rand('normal',&mu,&sigma);
data two; set test;
call streaminit (123);
do x=1 to &points;
age=&norm;
output;
end;
run; 

28 REPLIES 28
PaigeMiller
Diamond | Level 26

It sounds like you do not want a Normal distribution, so what distribution should be used?

--
Paige Miller
MichelleR0
Fluorite | Level 6
I’m not sure what the distribution for the data will be - I still need to
obtain that information. But if I am told that the age range was minimum
45 year olds and maximum 95 year olds, how can I restrict the min and max
age while also applying the specified mean and S.D.? Could the data s not
be normally distributed under these circumstances? Thank you.
PaigeMiller
Diamond | Level 26

Once you put minimum and maximum values on a distribution, it can't be normal (but it may be approximately normal, depending on your definition of approximately).

 

We are still waiting for you to tell us what distribution you want.

--
Paige Miller
ballardw
Super User

For a normal distribution the median should be the mean. If you want a different value for median than the mean then you may not want a normal distribution.

 

Chopping off the upper and lower at different standard deviations from the mean will move the mean/median values of your "normal" distribution. Your proposed range cuts off more of the upper tail (at about 2.4 sigma) than the lower (about -4.5 sigma) with your stated mean and standard deviation for the rand 'normal' so would tend to make the resulting observed mean/median LOWER than the specified mu.

 

Test the value of each generated age. If not in the range, then do the call to the rand function again. Look at the much maligned GOTO and LABEL in the data step for one way.

 

 

MichelleR0
Fluorite | Level 6

Thank you for your reply.  I believe I should clarify my goal.  I want to create a fake data set based on existing data.  So for example, if the existing data set has mean age and S.D. of 78 (11) then I want to create fake data with age that has the same mean and S.D. but also want the min and max age to be the same in the fake data as the real data set.  I want to duplicate this for other variables of interest from the original data set.

ballardw
Super User

@MichelleR0 wrote:

Thank you for your reply.  I believe I should clarify my goal.  I want to create a fake data set based on existing data.  So for example, if the existing data set has mean age and S.D. of 78 (11) then I want to create fake data with age that has the same mean and S.D. but also want the min and max age to be the same in the fake data as the real data set.  I want to duplicate this for other variables of interest from the original data set.


And median? Or are you dropping that part? Or do you expect the result to match other summary statistics like Skewness, Kurtosis, or some other moment?

 

Did your original data pass any test for coming from a normal distribution? If not, why did you start with normal data simulation?

MichelleR0
Fluorite | Level 6

Thank you for the question.  I would like the variables in the simulated data set to match the summary statistics (and moments? not sure what the difference is?) for each variable from the original data set.  So, I would like the simulated data set to have the same mean, S.D., median, Skewness, etc  for the continuous variables as the reference data set.  I would also like to do the same for the categorical variables.  If the reference data set had 60% male and 40% female, I would like the simulated data set to have the same proportion for gender, and so for other categorical variables.

 

I am new to data simulation so I am not sure what my first step should be when considering which procedure or method to use.  I am not familiar with Monte Carlo, but I know it is used often for data simulation.  Proc Surveyselect was recommended, but I'm not sure I understand the differences in these methods to know where to start.

PaigeMiller
Diamond | Level 26

You still have not told us what distribution to use. We need to know this. Saying you want a specific mean, standard deviation, etc. is not enough.


The best solution comes from @PGStats who recommended sampling from the original data, using the distribution of your original data; but you stated you don't have the original data available. So we need to know more than you have told us so far.

--
Paige Miller
MichelleR0
Fluorite | Level 6
I would like to use normal distribution for age. For weight, a nearly normal distribution, slight skewness to the right. Does that help? If I have a data set to use as a model, I am assuming I will apply different distributions as each variable would not necessarily have the same distribution. Is that correct?
PGStats
Opal | Level 21

The simplest way is to sample from the empirical distribution that you are trying to match. You can use proc surveyselect with options method=urs and sampsize=200. Statistically, the new sample will have the same moments as the original sample.

PG
MichelleR0
Fluorite | Level 6

I don't have access to the original data set.  I need to create (simulate) a data set that will have the same moments (same results for distributions of characteristics, exposure, outcomes, etc) as the results reported from the original data set.  I hope that helps clarify?  Thanks.

PaigeMiller
Diamond | Level 26

Okay, then we return to the issue of what distribution do you want to generate, with given median min and max?

--
Paige Miller
MichelleR0
Fluorite | Level 6
Thank you, PG. Per your suggestion, is there a way to keep the exposure fixed to 100 exposed and 100 unexposed? When I run proc surveyselect, it is assigning the appropriate moments to the independent variables, however, the proportion of exposed versus unexposed has changed so the moments are not the same when stratifying by exposure.
MichelleR0
Fluorite | Level 6

@PGStats. Thank you for your help with proc surveyselect. I am wondering, in what situations is proc survey select most often used? One of the other community members responded that this procedure does not in fact 'simulate' data - I now understand what was meant by that. I also better understand that my goal is to make sure the new simulated data set cannot be recognized as the original data set (I hope that makes sense?). However, I have read many discussions on this forum that it is difficult to simulate data for a large number of variables.

Thanks,
MichelleR0

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 28 replies
  • 3166 views
  • 9 likes
  • 5 in conversation