Solved: How to random sample with desired aggregate statistics

jamie0111 · Posted 03-23-2017 09:35 PM

Hi, there.

I have large data containing 20,000 people's total test score.

ex) Total Score

member1 727

member2 566

member3 661

member4 525

member5 609

.......

I want to extract exactly 2,818 people's information.

And I also want to 2,818 people's average total score is between 560~565.

Any help and tips will be much appreciated !

Thanks, Jamie.

PGStats · Posted 03-23-2017 11:10 PM

You could balance your random sampling like this:

/* Generate some fake data */
data examplePop;
do member = 1 to 20000;
    score = round(1000*rand("uniform"));
    output;
    end;
run;

%let mid=562.5;
%let size=2818;

/* Split the data into two subsets, lower and higher than the target value.
 Add random value for next step */
data lower higher;
set examplePop;
rnd = rand("uniform");
if score < &mid then output lower;
else output higher;
run;

/* Put both subsets into a random order */
proc sort data=lower out=lower(drop=rnd); by rnd; run;
proc sort data=higher out=higher(drop=rnd); by rnd; run;

/* Pick values from the subset that is on the opposite side from the target. If
 the current mean is lower than the target, pick a value at random from the 
 the higher subset, and vice-versa. */
data want;
if sum < &mid * n then set higher;
else set lower;
output;
n + 1;
sum + score;
if n = &size then stop;
drop n sum;
run;

/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std 
    from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std 
    from want;
quit;

EDIT: Added comments.

PG

View solution in original post

Reeza · Posted 03-23-2017 10:42 PM

That's not the definition of a random sample 🙂

I think one of the quickest ways is to randomly sample, calculate the average, ensure it meets the criteria and then stop or repeat.

Or limit the range of scores allowed. Is it ok for one person to have a score of 825 and another 300 as long as the average is between 560-565? That's a tight range as well....

EDIT: PROC SURVEYSELECT is typically the method to implement a random sample in SAS.

jamie0111 · Posted 03-24-2017 03:59 AM

I think score between 560-565 sampling will be much better.

Then, how can I limit the range of score of sampling...?

Your help will be much appreciated!

Thanks, Jamie.

abc_xyz · Posted 08-01-2018 02:40 PM

I was also facing similar kind of issue(even though not exactly same) . Do you have anything in mind as such if is is okay for me to have one person with a score of 825 and another 300 as long as the average is between 560-565. Since same thing only will happen with me in most of the cases.

PGStats · Posted 03-23-2017 11:10 PM

You could balance your random sampling like this:

/* Generate some fake data */
data examplePop;
do member = 1 to 20000;
    score = round(1000*rand("uniform"));
    output;
    end;
run;

%let mid=562.5;
%let size=2818;

/* Split the data into two subsets, lower and higher than the target value.
 Add random value for next step */
data lower higher;
set examplePop;
rnd = rand("uniform");
if score < &mid then output lower;
else output higher;
run;

/* Put both subsets into a random order */
proc sort data=lower out=lower(drop=rnd); by rnd; run;
proc sort data=higher out=higher(drop=rnd); by rnd; run;

/* Pick values from the subset that is on the opposite side from the target. If
 the current mean is lower than the target, pick a value at random from the 
 the higher subset, and vice-versa. */
data want;
if sum < &mid * n then set higher;
else set lower;
output;
n + 1;
sum + score;
if n = &size then stop;
drop n sum;
run;

/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std 
    from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std 
    from want;
quit;

EDIT: Added comments.

PG

Rick_SAS · Posted 03-24-2017 08:26 AM

PGStats presents an interesting algorithm, and I give him credit for originality. But let's be clear: the sampling scheme is a heuristic method that will often give good results, but is not guaranteed. At the very last step the algorithm might select an extreme outlier that renders the average value out of range. Still, it probably works for many real-life measurements, especially if the values are distributed symmetrically about the mean.

Ksharp · Posted 03-23-2017 11:48 PM

Better post it at IML forum, it is about data simulation. @Rick_SAS is there .

Ksharp · Posted 03-24-2017 06:16 AM

Mine looks like PG's code.

But neither mine nor PG's code can guarantee you to get what you want.

It is highly depended on your data.

data examplePop;
do member = 1 to 20000;
    score = round(1000*rand("uniform"));
    output;
    end;
run;

proc sort data=examplepop;
 by score;
run;

data want;
 set examplepop;
 array m{0:2817} _temporary_;
 array s{0:2817} _temporary_;
 i=mod(_n_,2818);
 m{i}=member;
 s{i}=score;
 if 560<=mean(of s{*})<=565 and _n_ gt 2817 then do;
  do j=0 to 2817;
   member=m{j};
   score=s{j};
   output;
  end;
  stop;
 end;
 drop i j ;
run;



/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std 
    from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std 
    from want;
quit;

How to random sample with desired aggregate statistics

Re: How to random sample with desired aggregate statistics

Re: How to random sample with desired aggregate statistics

Re: How to random sample with desired aggregate statistics

Re: How to random sample with desired aggregate statistics

Re: How to random sample with desired aggregate statistics

Re: How to random sample with desired aggregate statistics

Re: How to random sample with desired aggregate statistics

Re: How to random sample with desired aggregate statistics

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away