BookmarkSubscribeRSS Feed
Rick_SAS
SAS Super FREQ

 

> I want to create fake data with age that has the same mean and S.D. but also want the

> min and max age to be the same in the fake data as the real data set.

 

Without getting too technical, let me briefly say that you don't actually want to generate new data that ave the same mean/SD/min/max as the DATA. Rather, you want to assume that the moments of the data are good estimates for the underlying data-generating process that produced the data in the first place. You then simulate from a DISTRIBUTION that has those moments. Due to sampling variability, the simulated data will not have exactly the same mean/SD as the data, and this is GOOD for various reasons.

 

That said, there are several ways to accomplish what you want:

1. Resample from the data. This is called the bootstrap method and is equivalent to sampling from the empirical distribution of the data. Unfortunately, you say that you do not have access to the original data, so this method is not available.

2. If you have a table of percentiles for the original data, you can sample from the approximate empirical distribution

3. If you want to find a distribution that matches the moments of the data, you can perform a momemt-matching computation. However, this requires that you choose a distribution to simulate from.

4. You can use a flexible system of distributions to match the data. For example, you can use the Johnson SB and Johnson SU systems. However, to fit these systems, you need the data or at least percentiles.

5. You can use the PERT distribution, which requires only estimates of the min, mode, and max. (The PERT distribution is a special beta distribution; modeling a general beta distribution is also possible.)

 

If you don't have data or a table of percentiles, your options are limited. You might try the PERT distribution for a variable such as AGE, where you know the min, max, and mean/median.  

 

 

MichelleR0
Fluorite | Level 6
Thank you for your reply and for providing additional details in response to my questions. I was informed this morning that at a future date, I will have access to the original data. With that, could you please provide clarity on the following:

1. Can you explain further what you meant by, 'You then simulate from a DISTRIBUTION that has those moments.'?
2. Given that I will have access to the original data, I understand that I could use the bootstrap method in Proc Surveyselect. You can confirm my understanding that I 1) simulate data (i.e. create a new data set with fake data) based on the moments from the original data set and then 2) run Proc Surveyselect with method=bootstrap on the simulated/fake data? Do I understand correctly that the original data is only used to inform simulation process?
3) If the simulated data has values that are beyond the original data (and inclusion criteria), is there a way to restrict this? For example, if the original data included patients >18 and <70 years of age with a normally distributed mean age of 55, and the simulated data included patents <18 or >70 years of age, is it appropriate to place these restrictions on the simulated data?
4) Also, my understanding is that I should have the same number of patients in the exposed and unexposed groups as the original data set. When I simulate the data and the use Proc Surveyselect to draw a sample of 200 (out of 20,000 in simulated data set), I no longer get the same number of patients in the exposed and unexposed groups. This is causing large variations in the variable distributions when stratifying by exposure group.
Rick_SAS
SAS Super FREQ

1. There are two kinds of simulation: parametric and nonparametric. In a parametric simulation, you fit a model to the data to obtain parameter estimates. You then assume that the estimates are the actual parameters and simulate from the model that has those parameters. For example, you might fit a normal distribution to data, find that the estimates are mu=1.23 and sigma=4.56. You would then simulate data from N(1.23, 4.56).  In a nonparametric simulation, you use the bootstrap method to sample directly from the empirical distribution.

 

2. No. Your understanding is not correct. The input to PROC SURVEYSELECT is the original data. Please read about the Basic Bootstrap in "The essential guide to bootstrapping in SAS."


3) Regarding: "is it appropriate to place these restrictions on the simulated data?" You need to decide on the model FIRST, then simulate from that model. If you decide that the model is an unbounded distribution (for example, normal or exponential), then you might get values that are outside the range of the data. In many cases that is fine. In other situations (negative ages, extreme heights,...) that is not okay. If it is not okay, then you should choose a different model, such as a bounded distribution.

 

4) You can do stratified sampling with PROC SURVEYSELECT. It is not necessarily true that Whether you should get "the same number of patients in the exposed and unexposed groups as the original data set." It depends on the original data frame. For example, if the original design is "select 100 people at random," I might get 52 males and 48 females in the original data. But if I simulate that process, it is okay that each sample has a different proportion. On the other hand, if the original design is "select 50 males and 50 females," then you would want each simulated sample to have the same proportions.

 

IMHO, your questions go beyond what can easily be handled on this forum. They are conceptual questions about how to construct a simulation, not SAS programming questions. You might consider consulting with a statistician, reading a book about simulation, or otherwise learning more about how simulation needs to reflect the data-generating mechanism for the data.

 

Good luck!

MichelleR0
Fluorite | Level 6
The original design was to select 100 exposed and 100 unexposed. Using method=urs, I tried using the strata option for the variable that assigns exposure status. It returns a sample with an equal amount of subjects in each group, but less than n=100.
Rick_SAS
SAS Super FREQ

If you post your SAS code, we will be able to help you.

MichelleR0
Fluorite | Level 6
Let me know if you need additional information. Thank you.

proc surveyselect data = one method = urs seed = 3579 out = mi1 reps=1000 n=100;
strata trt;
run;
Rick_SAS
SAS Super FREQ

I think your code is okay, but perhaps you are unaware that (to save space), the default behavior of SURVEYSELECT is to create a frequency variable (named NumberHits). So you can check that each group has the correct frequency by using 

 

proc freq data=mi1;
where Replicate<5;
weight NumberHits;
tables trt*Replicate / norow nocol nopercent;
run;

If you don't want the frequency variable, use the OUTHITS option on PROC SURVEYSELECT, like this:

 

proc surveyselect data = one method = urs seed = 3579 out = mi1 reps=1000 n=100 OUTHITS;  /* USE OUTHITS */
strata trt;
run; 

proc freq data=mi1;
where Replicate<5;
tables trt*Replicate / norow nocol nopercent;
run;
PGStats
Opal | Level 21

In the output dataset, the variable NumberHits contains the number each observation was selected (remember METHOD=URS is sampling with replacement). If you want to get multiple copies of observations that were selected more than once, use option OUTHITS in the proc surveyselect statement.

PG
MichelleR0
Fluorite | Level 6
My original data has the following moments for age:

N Mean Std Dev Minimum Maximum
600 75.6240017 11.7483702 36.6307132 112.3859522

I used the following SAS code to simulate data, wanting a sample size of n=600.
proc surveyselect data=one method=urs reps=10000 n=600 seed=40070 out=mi;run;
proc means data = mi;var age;where replicate = 2;run;

The output is as follows:
Selection Method Unrestricted Random Sampling
Input Data Set ONE
Random Number Seed 40070
Sample Size 600
Expected Number of Hits 1
Sampling Weight 1
Number of Replicates 10000
Total Sample Size 6000000
Output Data Set MI

N Mean Std Dev Minimum Maximum
375 75.4457998 12.0702549 36.6307132 112.3859522

Can you explain why N=375 and not 600? Thank you.

PGStats
Opal | Level 21

Please read comments above by Rick and myself about option OUTHITS.

PG
MichelleR0
Fluorite | Level 6
Using the following code:
proc surveyselect data = one method = urs seed = 35 out = one_sim reps=1000 n = 300 OUTHITS;
strata trt;
run;
I get the desired 300 unique observations per exposure group. However, how do I resample and output a data set so that my new simulated data set has the same moments, but does not have the identical observations as the original data set?
Rick_SAS
SAS Super FREQ

> I get the desired 300 unique observations per exposure group. However, how do I resample and output a data set so that my new simulated data set has the same moments, but does not have the identical observations as the original data set?

 

Just to clarify, when you use PROC SURVEYSELECT, you are resampling from your data. (Technically, this is not a simulation, it is a bootstrap.)  Each resample is from the empirical distribution of the data, so ON AVERAGE the samples have the same moments as the original data. (You don't want each sample to have identical moments because that would destroy the sampling variability, which is an essential part of resampling and simulation.)

 

None of your samples have "identical observations" as the original data, but all of them are resamples (with replacement), which means that each observation in the resample is also an original observation.

 

If you need to review the properties of bootstrap samples, as well as how to perform bootstrap analyses in SAS, see the article "The essential guide to bootstrapping in SAS."

MichelleR0
Fluorite | Level 6
I think I misunderstood the use of proc surveyselect. My goal was to create a fake data set that would have similar aggregated distributions of independent and dependent variables to use for analyses. Based on previous posts I thought I could use proc surveyselect to do that instead of a data step to make the process easier.

From my understanding of the link you shared, bootstrapping is used when you want to see the possible range of estimates for a specific measure (e.g. relative risk) for the empirical data.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 28 replies
  • 3191 views
  • 9 likes
  • 5 in conversation