turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- How to random sample with desired aggregate statis...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-23-2017 09:35 PM - last edited on 03-23-2017 10:41 PM by Reeza

Hi, there.

I have large data containing 20,000 people's total test score.

ex) Total Score

member1 727

member2 566

member3 661

member4 525

member5 609

.......

I want to extract exactly 2,818 people's information.

And I also want to 2,818 people's average total score is between 560~565.

Any help and tips will be much appreciated !

Thanks, Jamie.

Accepted Solutions

Solution

03-24-2017
03:59 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jamie0111

03-23-2017 11:10 PM - edited 03-24-2017 12:20 AM

You could balance your random sampling like this:

```
/* Generate some fake data */
data examplePop;
do member = 1 to 20000;
score = round(1000*rand("uniform"));
output;
end;
run;
%let mid=562.5;
%let size=2818;
/* Split the data into two subsets, lower and higher than the target value.
Add random value for next step */
data lower higher;
set examplePop;
rnd = rand("uniform");
if score < &mid then output lower;
else output higher;
run;
/* Put both subsets into a random order */
proc sort data=lower out=lower(drop=rnd); by rnd; run;
proc sort data=higher out=higher(drop=rnd); by rnd; run;
/* Pick values from the subset that is on the opposite side from the target. If
the current mean is lower than the target, pick a value at random from the
the higher subset, and vice-versa. */
data want;
if sum < &mid * n then set higher;
else set lower;
output;
n + 1;
sum + score;
if n = &size then stop;
drop n sum;
run;
/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std
from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std
from want;
quit;
```

EDIT: Added comments.

PG

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jamie0111

03-23-2017 10:42 PM - edited 03-23-2017 10:44 PM

That's not the definition of a random sample

I think one of the quickest ways is to randomly sample, calculate the average, ensure it meets the criteria and then stop or repeat.

Or limit the range of scores allowed. Is it ok for one person to have a score of 825 and another 300 as long as the average is between 560-565? That's a tight range as well....

EDIT: PROC SURVEYSELECT is typically the method to implement a random sample in SAS.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

03-24-2017 03:59 AM

I think score between 560-565 sampling will be much better.

Then, how can I limit the range of score of sampling...?

Your help will be much appreciated!

Thanks, Jamie.

Solution

03-24-2017
03:59 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jamie0111

03-23-2017 11:10 PM - edited 03-24-2017 12:20 AM

You could balance your random sampling like this:

```
/* Generate some fake data */
data examplePop;
do member = 1 to 20000;
score = round(1000*rand("uniform"));
output;
end;
run;
%let mid=562.5;
%let size=2818;
/* Split the data into two subsets, lower and higher than the target value.
Add random value for next step */
data lower higher;
set examplePop;
rnd = rand("uniform");
if score < &mid then output lower;
else output higher;
run;
/* Put both subsets into a random order */
proc sort data=lower out=lower(drop=rnd); by rnd; run;
proc sort data=higher out=higher(drop=rnd); by rnd; run;
/* Pick values from the subset that is on the opposite side from the target. If
the current mean is lower than the target, pick a value at random from the
the higher subset, and vice-versa. */
data want;
if sum < &mid * n then set higher;
else set lower;
output;
n + 1;
sum + score;
if n = &size then stop;
drop n sum;
run;
/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std
from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std
from want;
quit;
```

EDIT: Added comments.

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

03-24-2017 08:26 AM

PGStats presents an interesting algorithm, and I give him credit for originality. But let's be clear: the sampling scheme is a heuristic method that will often give good results, but is not guaranteed. At the very last step the algorithm might select an extreme outlier that renders the average value out of range. Still, it probably works for many real-life measurements, especially if the values are distributed symmetrically about the mean.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jamie0111

03-23-2017 11:48 PM

Better post it at IML forum, it is about data simulation. @Rick_SAS is there .

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jamie0111

03-24-2017 06:16 AM - edited 03-24-2017 06:33 AM

Mine looks like PG's code.

But neither mine nor PG's code can guarantee you to get what you want.

It is highly depended on your data.

```
data examplePop;
do member = 1 to 20000;
score = round(1000*rand("uniform"));
output;
end;
run;
proc sort data=examplepop;
by score;
run;
data want;
set examplepop;
array m{0:2817} _temporary_;
array s{0:2817} _temporary_;
i=mod(_n_,2818);
m{i}=member;
s{i}=score;
if 560<=mean(of s{*})<=565 and _n_ gt 2817 then do;
do j=0 to 2817;
member=m{j};
score=s{j};
output;
end;
stop;
end;
drop i j ;
run;
/* Check the mean and std dev from the initial population and the selected sample */
proc sql;
select "examplePop" , count(*) as n, mean(score) as mean, std(score) as std
from examplePop
union all
select "want", count(*) as n, mean(score) as mean, std(score) as std
from want;
quit;
```