Hi, there.
I have large data containing 20,000 people's total test score.
ex) Total Score
member1 727
member2 566
member3 661
member4 525
member5 609
.......
I want to extract exactly 2,818 people's information.
And I also want to 2,818 people's average total score is between 560~565.
Any help and tips will be much appreciated !
Thanks, Jamie.
You could balance your random sampling like this:
/* Generate some fake data */
data examplePop;
do member = 1 to 20000;
score = round(1000*rand("uniform"));
output;
end;
run;
%let mid=562.5;
%let size=2818;
/* Split the data into two subsets, lower and higher than the target value.
Add random value for next step */
data lower higher;
set examplePop;
rnd = rand("uniform");
if score < &mid then output lower;
else output higher;
run;
/* Put both subsets into a random order */
proc sort data=lower out=lower(drop=rnd); by rnd; run;
proc sort data=higher out=higher(drop=rnd); by rnd; run;
/* Pick values from the subset that is on the opposite side from the target. If
the current mean is lower than the target, pick a value at random from the
the higher subset, and vice-versa. */
data want;
if sum < &mid * n then set higher;
else set lower;
output;
n + 1;
sum + score;
if n = &size then stop;
drop n sum;
run;
/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std
from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std
from want;
quit;
EDIT: Added comments.
That's not the definition of a random sample 🙂
I think one of the quickest ways is to randomly sample, calculate the average, ensure it meets the criteria and then stop or repeat.
Or limit the range of scores allowed. Is it ok for one person to have a score of 825 and another 300 as long as the average is between 560-565? That's a tight range as well....
EDIT: PROC SURVEYSELECT is typically the method to implement a random sample in SAS.
I think score between 560-565 sampling will be much better.
Then, how can I limit the range of score of sampling...?
Your help will be much appreciated!
Thanks, Jamie.
I was also facing similar kind of issue(even though not exactly same) . Do you have anything in mind as such if is is okay for me to have one person with a score of 825 and another 300 as long as the average is between 560-565. Since same thing only will happen with me in most of the cases.
You could balance your random sampling like this:
/* Generate some fake data */
data examplePop;
do member = 1 to 20000;
score = round(1000*rand("uniform"));
output;
end;
run;
%let mid=562.5;
%let size=2818;
/* Split the data into two subsets, lower and higher than the target value.
Add random value for next step */
data lower higher;
set examplePop;
rnd = rand("uniform");
if score < &mid then output lower;
else output higher;
run;
/* Put both subsets into a random order */
proc sort data=lower out=lower(drop=rnd); by rnd; run;
proc sort data=higher out=higher(drop=rnd); by rnd; run;
/* Pick values from the subset that is on the opposite side from the target. If
the current mean is lower than the target, pick a value at random from the
the higher subset, and vice-versa. */
data want;
if sum < &mid * n then set higher;
else set lower;
output;
n + 1;
sum + score;
if n = &size then stop;
drop n sum;
run;
/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std
from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std
from want;
quit;
EDIT: Added comments.
PGStats presents an interesting algorithm, and I give him credit for originality. But let's be clear: the sampling scheme is a heuristic method that will often give good results, but is not guaranteed. At the very last step the algorithm might select an extreme outlier that renders the average value out of range. Still, it probably works for many real-life measurements, especially if the values are distributed symmetrically about the mean.
Better post it at IML forum, it is about data simulation. @Rick_SAS is there .
Mine looks like PG's code.
But neither mine nor PG's code can guarantee you to get what you want.
It is highly depended on your data.
data examplePop;
do member = 1 to 20000;
score = round(1000*rand("uniform"));
output;
end;
run;
proc sort data=examplepop;
by score;
run;
data want;
set examplepop;
array m{0:2817} _temporary_;
array s{0:2817} _temporary_;
i=mod(_n_,2818);
m{i}=member;
s{i}=score;
if 560<=mean(of s{*})<=565 and _n_ gt 2817 then do;
do j=0 to 2817;
member=m{j};
score=s{j};
output;
end;
stop;
end;
drop i j ;
run;
/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std
from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std
from want;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.