> I want to create fake data with age that has the same mean and S.D. but also want the
> min and max age to be the same in the fake data as the real data set.
Without getting too technical, let me briefly say that you don't actually want to generate new data that ave the same mean/SD/min/max as the DATA. Rather, you want to assume that the moments of the data are good estimates for the underlying data-generating process that produced the data in the first place. You then simulate from a DISTRIBUTION that has those moments. Due to sampling variability, the simulated data will not have exactly the same mean/SD as the data, and this is GOOD for various reasons.
That said, there are several ways to accomplish what you want:
1. Resample from the data. This is called the bootstrap method and is equivalent to sampling from the empirical distribution of the data. Unfortunately, you say that you do not have access to the original data, so this method is not available.
2. If you have a table of percentiles for the original data, you can sample from the approximate empirical distribution.
3. If you want to find a distribution that matches the moments of the data, you can perform a momemt-matching computation. However, this requires that you choose a distribution to simulate from.
4. You can use a flexible system of distributions to match the data. For example, you can use the Johnson SB and Johnson SU systems. However, to fit these systems, you need the data or at least percentiles.
5. You can use the PERT distribution, which requires only estimates of the min, mode, and max. (The PERT distribution is a special beta distribution; modeling a general beta distribution is also possible.)
If you don't have data or a table of percentiles, your options are limited. You might try the PERT distribution for a variable such as AGE, where you know the min, max, and mean/median.
1. There are two kinds of simulation: parametric and nonparametric. In a parametric simulation, you fit a model to the data to obtain parameter estimates. You then assume that the estimates are the actual parameters and simulate from the model that has those parameters. For example, you might fit a normal distribution to data, find that the estimates are mu=1.23 and sigma=4.56. You would then simulate data from N(1.23, 4.56). In a nonparametric simulation, you use the bootstrap method to sample directly from the empirical distribution.
2. No. Your understanding is not correct. The input to PROC SURVEYSELECT is the original data. Please read about the Basic Bootstrap in "The essential guide to bootstrapping in SAS."
3) Regarding: "is it appropriate to place these restrictions on the simulated data?" You need to decide on the model FIRST, then simulate from that model. If you decide that the model is an unbounded distribution (for example, normal or exponential), then you might get values that are outside the range of the data. In many cases that is fine. In other situations (negative ages, extreme heights,...) that is not okay. If it is not okay, then you should choose a different model, such as a bounded distribution.
4) You can do stratified sampling with PROC SURVEYSELECT. It is not necessarily true that Whether you should get "the same number of patients in the exposed and unexposed groups as the original data set." It depends on the original data frame. For example, if the original design is "select 100 people at random," I might get 52 males and 48 females in the original data. But if I simulate that process, it is okay that each sample has a different proportion. On the other hand, if the original design is "select 50 males and 50 females," then you would want each simulated sample to have the same proportions.
IMHO, your questions go beyond what can easily be handled on this forum. They are conceptual questions about how to construct a simulation, not SAS programming questions. You might consider consulting with a statistician, reading a book about simulation, or otherwise learning more about how simulation needs to reflect the data-generating mechanism for the data.
Good luck!
If you post your SAS code, we will be able to help you.
I think your code is okay, but perhaps you are unaware that (to save space), the default behavior of SURVEYSELECT is to create a frequency variable (named NumberHits). So you can check that each group has the correct frequency by using
proc freq data=mi1;
where Replicate<5;
weight NumberHits;
tables trt*Replicate / norow nocol nopercent;
run;
If you don't want the frequency variable, use the OUTHITS option on PROC SURVEYSELECT, like this:
proc surveyselect data = one method = urs seed = 3579 out = mi1 reps=1000 n=100 OUTHITS; /* USE OUTHITS */
strata trt;
run;
proc freq data=mi1;
where Replicate<5;
tables trt*Replicate / norow nocol nopercent;
run;
In the output dataset, the variable NumberHits contains the number each observation was selected (remember METHOD=URS is sampling with replacement). If you want to get multiple copies of observations that were selected more than once, use option OUTHITS in the proc surveyselect statement.
Please read comments above by Rick and myself about option OUTHITS.
> I get the desired 300 unique observations per exposure group. However, how do I resample and output a data set so that my new simulated data set has the same moments, but does not have the identical observations as the original data set?
Just to clarify, when you use PROC SURVEYSELECT, you are resampling from your data. (Technically, this is not a simulation, it is a bootstrap.) Each resample is from the empirical distribution of the data, so ON AVERAGE the samples have the same moments as the original data. (You don't want each sample to have identical moments because that would destroy the sampling variability, which is an essential part of resampling and simulation.)
None of your samples have "identical observations" as the original data, but all of them are resamples (with replacement), which means that each observation in the resample is also an original observation.
If you need to review the properties of bootstrap samples, as well as how to perform bootstrap analyses in SAS, see the article "The essential guide to bootstrapping in SAS."
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.