This is the problem, I have a list of numbers and I want to randomly put them into three buckets. This will be a no replacement. I want 50% of the numbers to go in bucket A, 30% to go into bucket B, and 20% to go in bucket C. From this I want to know the mean of each bucket, then run it 100 times and have a sheet that says the mean of each bucket for each run. I have no idea where to start. Please help.
Thank you!
I'm using SAS 9.4. This would certainly work in 9.2 :
data test;
call streaminit(876857);
do id = 1 to 64;
x = rand("NORMAL");
output;
end;
run;
proc sql;
select
round(0.5*count(*)) as n50,
round(0.3*count(*)) as n30
into :n50, :n30
from test;
quit;
data sample;
set test;
do replicate = 1 to 100;
rnd = rand("UNIFORM");
output;
end;
run;
proc sort data=sample; by replicate rnd; run;
data sampleGroups;
set sample; by replicate;
if first.replicate then n = 0;
n + 1;
if n <= &n50 then bucket = "A";
else if n <= &n50 + &n30 then bucket = "B";
else bucket = "C";
drop n rnd;
run;
proc sql;
create table stats as
select replicate, bucket, mean(x) as meanX
from sampleGroups
group by replicate, bucket;
quit;
PG
Homework?
You can do this in Base SAS (DATA step + PROC MEANS) or in SAS/IML. By coincidence, those are the same percentages I use in my book Simulating Data with SAS.
To do the experiment one time, look at the article "Simulate categorical data in SAS".
To write this as a Monte Carlo simulation, look at the article "Simulation in SAS: The slow way or the BY way"
Anyone who has a copy of my book can look on p. 130-131 for an interesting twist to this problem.
Not homework, for actual work. I guess I'm not following on how to create multiple sets. I see how to create one set. Can you elaborate?
A stub of an example;
Data one two three;
/* get values from somewhere*/
if category=1 then output one;
else if category=2 then output two;
else if category=3 then output three;
run;
Sorry for the confusion. Somehow I misread the phrase "I have a list of numbers." You are doing resampling rather than pure simulation.
There is always an issue with questions like this: Do you want EXACTLY 50%, 30%, 20%, or do you want to sample with probability 0.5, 0.3, and 0.2. I assume the latter.
As PG says, the general tool for resampling is PROC SURVEYSELECT. For this simple case you could also do the sampling yourself by assigning the "buckets" in a DATA step:
data test;
keep x;
call streaminit(876857);
do i = 1 to 64;
x = rand("NORMAL");
output;
end;
run;
/* randomly assign numbers to bins; repeat 100 times */
%let NumSamples = 100;
data Many;
set test;
do SampleID = 1 to &NumSamples; /* 1. duplicate sample */
bin = rand("Table", 0.5, 0.3, 0.2); /* assign to bin */
output;
end;
run;
/* prepare for BY-group analysis */
proc sort data=Many;
by SampleID;
run;
/* create "sheet that says the mean of each bucket for each run" */
proc means data=Many noprint;
by SampleID; /* 2. compute many statistics */
class bin;
var x;
output out=OutStats(where=(_TYPE_=1)) mean=SampleMean;
run;
proc print data=OutStats;
var SampleID bin _FREQ_ SampleMean;
run;
Oh, I did not mean probability. I meant if my list was 10 then A would have 5 of the numbers, B would have 3 of the numbers, and C would have the last two numbers in it.
I'm very sorry for the confusion.
If this was my list
DRG |
2 |
2 |
2 |
3 |
5 |
7 |
13 |
19 |
22 |
54 |
For example one run would have
A | B | C |
2 | 2 | 2 |
3 | 5 | 7 |
13 | 19 | |
22 | ||
54 |
Then I want to know the mean of each and say
Run Average | A | B | C |
Run 1 Average | 18.8 | 8.666667 | 4.5 |
I think Ricks code is correct, what if you had 20 observations, you'd want 10 in the first group, 6 in the second and 4 in the third. That's what the probability does in his code.
Dr. Wicklin's text, "Simulating Data Using SAS", is phenomenal!
I highly recommend it.
Brian
Use proc surveyselect with the GROUPS=() option :
/* Test data */
data test;
call streaminit(876857);
do id = 1 to 64;
x = rand("NORMAL");
output;
end;
run;
/* Calculate the bucket sizes */
proc sql;
select
round(0.5*count(*)) as n50,
round(0.3*count(*)) as n30,
count(*) - calculated n50 - calculated n30 as n20
into :n50, :n30, :n20
from test;
quit;
/* Generate random groups, 100 times */
proc surveyselect data=test out=sample rep=100 groups=(&n50 &n30 &n20); run;
/* Calculate group means */
proc sql;
create table stats as
select replicate, groupID as bucket, mean(x) as meanX
from sample
group by replicate, groupID;
quit;
PG
PGStats
The groups option is causing a syntax error. It seems groups is not an option for this proc SAS/STAT(R) 9.2 User's Guide, Second Edition
It is possible it is another option?
Thank you so much.
I'm using SAS 9.4. This would certainly work in 9.2 :
data test;
call streaminit(876857);
do id = 1 to 64;
x = rand("NORMAL");
output;
end;
run;
proc sql;
select
round(0.5*count(*)) as n50,
round(0.3*count(*)) as n30
into :n50, :n30
from test;
quit;
data sample;
set test;
do replicate = 1 to 100;
rnd = rand("UNIFORM");
output;
end;
run;
proc sort data=sample; by replicate rnd; run;
data sampleGroups;
set sample; by replicate;
if first.replicate then n = 0;
n + 1;
if n <= &n50 then bucket = "A";
else if n <= &n50 + &n30 then bucket = "B";
else bucket = "C";
drop n rnd;
run;
proc sql;
create table stats as
select replicate, bucket, mean(x) as meanX
from sampleGroups
group by replicate, bucket;
quit;
PG
I am considering using proc rank .
data class; set sashelp.class; ran=ranuni(0); keep ran; run; proc rank data=class out=temp groups=10; var ran; ranks rank; run; data want; merge temp(where=(rank in (0:4)) rename=(ran=A)) temp(where=(rank in (5:7)) rename=(ran=B)) temp(where=(rank in (8 9)) rename=(ran=C)) ; drop rank; run;
Xia Keshan
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.