Hello all,
I have known parameter values (mean and std of population distribution) and have a particular year's data (data1) of the population. I am trying to use bootstrap sampling to draw a sample from the data 1 to meet the population parameter values of mean and std. The data 1 has pretty big size about n=60,000. Is it possible?
thank you
leex1514
Start here:
https://blogs.sas.com/content/iml/2018/12/12/essential-guide-bootstrapping-sas.html
This part doesn't sound like bootstrapping at all:
to draw a sample from the data 1 to meet the population parameter values of mean and std
and maybe you should explain further.
Interesting question.
Calling @Rick_SAS
By "bootstrap sample" I assume you mean "sample with replacement" with size n.
In general, for small samples, you shouldn't expect to be able to get the exact values. For example, in a sample that has two observations with values 0 and 1, the only possible means are 0, 0.5, and 1.
However, there are n! different "bootstrap samples" in a sample of size n, so that's a lot of combinations. So, yes, you should be able to get reasonably close to the parameter values, assuming that the sample is representative of the population.
But before we talk about possible ways to make this happen, may I ask what you are trying to achieve and why? What is the purpose of manufacturing a new set of data that has exactly the same mean and SD as some parameters? The field of statistics was developed to analyze the data that you have and make inferences about the population parameters. Modifying the data is not required or recommended.
@leex1514 wrote:
I am trying to use this sample to calibrate items (of students of certain ability distribution). The purpose is to compare item parameter estimates of my data1 and of matched sample to the population distribution of ability.
Compare these two things to learn what? Why do you need a sample if you have the entire population?
Okay, I know nothing about Item Response Theory, but I do know that SAS has PROC IRT, does that help?
The IRT procedure enables you to estimate various item response theory models. The following list summarizes some of the basic features of the IRT procedure:
uses the Rasch model; one-, two-, three-, and four-parameter models; graded response model with logistic or probit link; and generalized partial credit model
enables different items to have different response models
performs multidimensional exploratory and confirmatory analysis
performs multiple-group analysis, with fixed values and equality constraints within and between groups
estimates factor scores by using maximum likelihood (ML), maximum a posteriori (MAP), and expected a posteriori (EAP) methods
This sounds something like a simulation of a dataset with known mean and sd, but with an unknown distribution If that is the case, then you can rely on the central limit theorem, and do something like:
data sample(keep=X);
call streaminit(123);
do j=1 to 60000;
X = rand('Normal', known_mean, known_sd);
output;
end;
run;
where known_mean and known_sd are the parameter values you have.
I would also recommend looking at @Rick_SAS 's blog and especially this paper, should you decide on a simulation approach:
https://support.sas.com/resources/papers/proceedings15/SAS1387-2015.pdf
If your data1 looks like a mixture of distributions or something unusual, this paper gives you some approaches.
SteveDenham
Then the means and sd's are for each iterm?
SteveDenham
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.