This is the problem, I have a list of numbers and I want to randomly put them into three buckets. This will be a no replacement. I want 50% of the numbers to go in bucket A, 30% to go into bucket B, and 20% to go in bucket C. From this I want to know the mean of each bucket, then run it 100 times and have a sheet that says the mean of each bucket for each run. I have no idea where to start. Please help.
Thank you!
I'm using SAS 9.4. This would certainly work in 9.2 :
data test;
call streaminit(876857);
do id = 1 to 64;
x = rand("NORMAL");
output;
end;
run;
proc sql;
select
round(0.5*count(*)) as n50,
round(0.3*count(*)) as n30
into :n50, :n30
from test;
quit;
data sample;
set test;
do replicate = 1 to 100;
rnd = rand("UNIFORM");
output;
end;
run;
proc sort data=sample; by replicate rnd; run;
data sampleGroups;
set sample; by replicate;
if first.replicate then n = 0;
n + 1;
if n <= &n50 then bucket = "A";
else if n <= &n50 + &n30 then bucket = "B";
else bucket = "C";
drop n rnd;
run;
proc sql;
create table stats as
select replicate, bucket, mean(x) as meanX
from sampleGroups
group by replicate, bucket;
quit;
PG
Homework?
You can do this in Base SAS (DATA step + PROC MEANS) or in SAS/IML. By coincidence, those are the same percentages I use in my book Simulating Data with SAS.
To do the experiment one time, look at the article "Simulate categorical data in SAS".
To write this as a Monte Carlo simulation, look at the article "Simulation in SAS: The slow way or the BY way"
Anyone who has a copy of my book can look on p. 130-131 for an interesting twist to this problem.
Not homework, for actual work. I guess I'm not following on how to create multiple sets. I see how to create one set. Can you elaborate?
A stub of an example;
Data one two three;
/* get values from somewhere*/
if category=1 then output one;
else if category=2 then output two;
else if category=3 then output three;
run;
Sorry for the confusion. Somehow I misread the phrase "I have a list of numbers." You are doing resampling rather than pure simulation.
There is always an issue with questions like this: Do you want EXACTLY 50%, 30%, 20%, or do you want to sample with probability 0.5, 0.3, and 0.2. I assume the latter.
As PG says, the general tool for resampling is PROC SURVEYSELECT. For this simple case you could also do the sampling yourself by assigning the "buckets" in a DATA step:
data test;
keep x;
call streaminit(876857);
do i = 1 to 64;
x = rand("NORMAL");
output;
end;
run;
/* randomly assign numbers to bins; repeat 100 times */
%let NumSamples = 100;
data Many;
set test;
do SampleID = 1 to &NumSamples; /* 1. duplicate sample */
bin = rand("Table", 0.5, 0.3, 0.2); /* assign to bin */
output;
end;
run;
/* prepare for BY-group analysis */
proc sort data=Many;
by SampleID;
run;
/* create "sheet that says the mean of each bucket for each run" */
proc means data=Many noprint;
by SampleID; /* 2. compute many statistics */
class bin;
var x;
output out=OutStats(where=(_TYPE_=1)) mean=SampleMean;
run;
proc print data=OutStats;
var SampleID bin _FREQ_ SampleMean;
run;
Oh, I did not mean probability. I meant if my list was 10 then A would have 5 of the numbers, B would have 3 of the numbers, and C would have the last two numbers in it.
I'm very sorry for the confusion.
If this was my list
DRG |
2 |
2 |
2 |
3 |
5 |
7 |
13 |
19 |
22 |
54 |
For example one run would have
A | B | C |
2 | 2 | 2 |
3 | 5 | 7 |
13 | 19 | |
22 | ||
54 |
Then I want to know the mean of each and say
Run Average | A | B | C |
Run 1 Average | 18.8 | 8.666667 | 4.5 |
I think Ricks code is correct, what if you had 20 observations, you'd want 10 in the first group, 6 in the second and 4 in the third. That's what the probability does in his code.
Dr. Wicklin's text, "Simulating Data Using SAS", is phenomenal!
I highly recommend it.
Brian
Use proc surveyselect with the GROUPS=() option :
/* Test data */
data test;
call streaminit(876857);
do id = 1 to 64;
x = rand("NORMAL");
output;
end;
run;
/* Calculate the bucket sizes */
proc sql;
select
round(0.5*count(*)) as n50,
round(0.3*count(*)) as n30,
count(*) - calculated n50 - calculated n30 as n20
into :n50, :n30, :n20
from test;
quit;
/* Generate random groups, 100 times */
proc surveyselect data=test out=sample rep=100 groups=(&n50 &n30 &n20); run;
/* Calculate group means */
proc sql;
create table stats as
select replicate, groupID as bucket, mean(x) as meanX
from sample
group by replicate, groupID;
quit;
PG
PGStats
The groups option is causing a syntax error. It seems groups is not an option for this proc SAS/STAT(R) 9.2 User's Guide, Second Edition
It is possible it is another option?
Thank you so much.
I'm using SAS 9.4. This would certainly work in 9.2 :
data test;
call streaminit(876857);
do id = 1 to 64;
x = rand("NORMAL");
output;
end;
run;
proc sql;
select
round(0.5*count(*)) as n50,
round(0.3*count(*)) as n30
into :n50, :n30
from test;
quit;
data sample;
set test;
do replicate = 1 to 100;
rnd = rand("UNIFORM");
output;
end;
run;
proc sort data=sample; by replicate rnd; run;
data sampleGroups;
set sample; by replicate;
if first.replicate then n = 0;
n + 1;
if n <= &n50 then bucket = "A";
else if n <= &n50 + &n30 then bucket = "B";
else bucket = "C";
drop n rnd;
run;
proc sql;
create table stats as
select replicate, bucket, mean(x) as meanX
from sampleGroups
group by replicate, bucket;
quit;
PG
I am considering using proc rank .
data class; set sashelp.class; ran=ranuni(0); keep ran; run; proc rank data=class out=temp groups=10; var ran; ranks rank; run; data want; merge temp(where=(rank in (0:4)) rename=(ran=A)) temp(where=(rank in (5:7)) rename=(ran=B)) temp(where=(rank in (8 9)) rename=(ran=C)) ; drop rank; run;
Xia Keshan
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.