First of all , it is NOT a question for a help.
Last night, I watched an intesting vedio .
A magician hold a black package in which there are 24 balls, 8 is red, 8 is blue, 8 is yellow.
He asked an audience to pull out 12 balls from his black package.
If the color balls conform to 3:4:5 ,then audience should pay him 10 dollars .
If it was others, he pay this audience 10 dollars .
The intesting thing happened, most of audience obtained 3:4:5 ,and this magician earn many money.
Why? In general, I would expect that probability should be 4:4:4 ,since it was randomly picked up .
Just to confirm this result , I wrote some IML code to simulate many times to see how the proability distributed. This code is motivated by @Rick_SAS blog(which conform to Hypergeometric distribution):
https://blogs.sas.com/content/iml/2015/09/30/balls-and-urns1.html
https://blogs.sas.com/content/iml/2015/10/14/sim-2x2-model.html
proc iml;
call randseed(123);
/* simulate 1000000 times */
n=1000000;
want=j(n,3,.);
do i=1 to n;
/*24 balls in the black package.8 is red,8 is blue,8 is yellow*/
expect={8 8 8}; sum=sum(expect);
/*pull out 12 balls*/
want_sum=12;
/*get the red balls*/
if want_sum=0 then want[i,1]=0;
else do;
call randgen(x, "Hyper", sum, expect[1], want_sum);
want[i,1]= x;
sum=sum-expect[1];
want_sum=want_sum-x;
end;
/*get the blue balls*/
if want_sum=0 then want[i,2]=0;
else do;
call randgen(x, "Hyper", sum, expect[2], want_sum);
want[i,2]= x;
sum=sum-expect[2];
want_sum=want_sum-x;
end;
/*get the yellow balls*/
if want_sum=0 then want[i,3]=0;
else do;
call randgen(x, "Hyper", sum, expect[3], want_sum);
want[i,3]= x;
sum=sum-expect[3];
want_sum=want_sum-x;
end;
end;
create want from want;
append from want;
close;
quit;
data want2;
set want;
length flag $ 8;
call sort(of _numeric_);
flag=catx(':',of _numeric_);
run;
proc freq data=want2 order=freq;
table flag/plots=freqplot(orient=horizontal scale=percent);
run;
After simulating one million times , I got this graph. You can see 3:4:5 occurred almost 50 percent ,that is a big huge hit.
Hello @Ksharp,
I think the key point here is that the outcome "3:4:5" is actually "3:4:5 or 3:5:4 or 4:3:5 or 4:5:3 or 5:3:4 or 5:4:3". While each of these six events is less likely to occur than "4:4:4" -- which is the most probable combination (so your intuition is right) --, the sum of those six probabilities (hence the probability of obtaining a 3:4:5 ratio, regardless of which colors correspond to "3", "4" and "5"), 6*0.0811787...=0.487072..., is much greater than the probability of the single event "4:4:4", 0.126841....
The probabilities can be computed easily from the multivariate hypergeometric distribution:
%let K1=8;
%let K2=8;
%let K3=8;
%let n=12;
data prob(drop=i j);
length ratio $11;
do k1=0 to &K1;
do k2=0 to &K2;
do k3=0 to &K3;
if sum(of k:)=&n then do;
i=min(of k:);
j=max(of k:);
ratio=catx(':',i,&n-i-j,j);
p=comb(&K1,k1)*comb(&K2,k2)*comb(&K3,k3)/comb(&K1+&K2+&K3,&n);
output;
end;
end;
end;
end;
run;
proc summary data=prob nway;
class ratio;
var p;
output out=want(drop=_:) sum=;
run;
proc print data=want;
sum p;
run;
Result:
Obs ratio p 1 0:4:8 0.00016 2 0:5:7 0.00099 3 0:6:6 0.00087 4 1:3:8 0.00099 5 1:4:7 0.00994 6 1:5:6 0.02783 7 2:2:8 0.00087 8 2:3:7 0.02783 9 2:4:6 0.12177 10 2:5:5 0.09741 11 3:3:6 0.09741 12 3:4:5 0.48707 13 4:4:4 0.12684 ======= 1.00000
Hello @Ksharp,
I think the key point here is that the outcome "3:4:5" is actually "3:4:5 or 3:5:4 or 4:3:5 or 4:5:3 or 5:3:4 or 5:4:3". While each of these six events is less likely to occur than "4:4:4" -- which is the most probable combination (so your intuition is right) --, the sum of those six probabilities (hence the probability of obtaining a 3:4:5 ratio, regardless of which colors correspond to "3", "4" and "5"), 6*0.0811787...=0.487072..., is much greater than the probability of the single event "4:4:4", 0.126841....
The probabilities can be computed easily from the multivariate hypergeometric distribution:
%let K1=8;
%let K2=8;
%let K3=8;
%let n=12;
data prob(drop=i j);
length ratio $11;
do k1=0 to &K1;
do k2=0 to &K2;
do k3=0 to &K3;
if sum(of k:)=&n then do;
i=min(of k:);
j=max(of k:);
ratio=catx(':',i,&n-i-j,j);
p=comb(&K1,k1)*comb(&K2,k2)*comb(&K3,k3)/comb(&K1+&K2+&K3,&n);
output;
end;
end;
end;
end;
run;
proc summary data=prob nway;
class ratio;
var p;
output out=want(drop=_:) sum=;
run;
proc print data=want;
sum p;
run;
Result:
Obs ratio p 1 0:4:8 0.00016 2 0:5:7 0.00099 3 0:6:6 0.00087 4 1:3:8 0.00099 5 1:4:7 0.00994 6 1:5:6 0.02783 7 2:2:8 0.00087 8 2:3:7 0.02783 9 2:4:6 0.12177 10 2:5:5 0.09741 11 3:3:6 0.09741 12 3:4:5 0.48707 13 4:4:4 0.12684 ======= 1.00000
Looks like we went the same path 😄
Just to be clear: probability is not my super-power but:
Ain't that case of multivariate hypergeometric distribution?
https://en.wikipedia.org/wiki/Hypergeometric_distribution#Multivariate_hypergeometric_distribution
and the fact that:
P(3,4,5) is independent from P(3,5,4), etc,
P(3,4,5 or 3,5,4) = P(3,4,5) + P(3,5,4) - P(3,4,5 and 3,5,4) = P(3,4,5) + P(3,5,4)
so probability P("I have 3:4:5 sequence") would be just sum of P(3,4,5) + P(3,5,4) + ...+ P(5,4,3).
data have;
input grp $ r g b;
cards;
a 4 4 4
b 3 4 5
b 3 5 4
b 4 3 5
b 4 5 3
b 5 3 4
b 5 4 3
;
run;
/*
a = get all 4
b = get 3:4:5
*/
data want;
rn=8;
gn=8;
bn=8;
N=rn+gn+bn;
set have;
k=r+g+b;
p = /* https://en.wikipedia.org/wiki/Hypergeometric_distribution#Multivariate_hypergeometric_distribution */
(comb(rn, r)
*comb(gn, g)
*comb(bn, b)
)
/comb(n, k)
;
put (_ALL_) (=/);
if _N_=2 then sum=0;
sum+p;
run;
proc print;
run;
[EDIT:] proc print:
Bart
The key is that you are pulling WITHOUT REPLACEMENT, so the conditional probabilities change as you pull. The expected ratio is only 4:4:4 is you pull with replacement. You can do a simpler analysis of 6 balls (3R, 3B, 3Y) and pulling 3 balls to see that the ratios are not even. If I did the analysis correctly, in that situation there is a 70% chance that the ratio is 2:1:0 and only a 30% chance that the ratio is 1:1:1.
I think you can perform the simulation more concisely if you use the SAMPLE function to pull the samples and use the TABULATE subroutine to analyze the ratio:
proc iml;
call randseed(123);
/* simulate 1000000 times */
n=1000000;
/* create binary variable equal to 1 if the ratio is 3:4:5 */
is_345 = j(n,1,0);
set = repeat('R',8) // repeat('G',8) // repeat('Y',8);
draws = sample(set, 12 //n, "NoReplace"); /* draw n samples of size 12 */
do i = 1 to n;
call tabulate(labels, freq, draws[i,]);
if ncol(freq)=3 then
is_345[i] = all(freq={3 4 5}) | all(freq={3 5 4}) |
all(freq={4 3 5}) | all(freq={5 3 4}) |
all(freq={4 5 3}) | all(freq={5 4 3}) ;
end;
/* estimate proportion of draws that result in 3:4:5 */
prop_345 = mean(is_345);
print prop_345;
In this simulation, the result is 0.4867 of the draws have ratio 3:4:5. Since this is less than 50%, the magician must have been using "magic" (that is, cheating) to win money.
If someone doesn't have IML, just for fun, 4GL version (there is a bunch of code commented out from my try-and-errors for optimisation, for those who like to test):
/* arrays simulation */
options mprint;
data _null_;
array bwb[24] _temporary_ (8*(1 2 3)); /* template box with balls */
array b[24] _temporary_; /* box from which we are selecting */
array r[3] ; /* couter for selected color 1, 2, 3 */
call streaminit(42); /* set seed = 42 */
array x[0:12,0:12,0:12] _temporary_; /* array for results, 0=color was never selected */
do iter=0 to 1e6-1; /* iterations */
/* populate b with balls */
/*
do n=1 to 24;
b[n]=bwb[n];
end;
*//*
call pokelong (
put (PEEKCLONG (ADDRLONG(bwb[1]), 192), $192.)
, ADDRLONG(b[1])
, 192);
*//**/
%macro loop(n);
%do n=1 %to &n.;
b[&n.]=bwb[&n.];
%end;
%mend loop;
%loop(24)
/**/
r1=0;r2=0;r3=0; /* initiate counters */
/*
do j=24 to 13 by -1;
i = RAND("integer",1,j);
r[b[i]]+1;
b[i]=b[j];
end;
*/
%macro loop2(n,m);
%do n=&n. %to %eval(&n.-&m.+1) %by -1; /* select 12 from 24 with no repetitions */
i = RAND("integer",1,&n.); /* get random number from 1-24, 1-23, 1-22, etc. */
r[b[i]]+1; /* increase count of selected ball */
b[i]=b[&n.]; /* fill gap in "i" (after selected ball) with last ball ("n") */
%end;
%mend loop2;
%loop2(24,12)
call sortn(of r[*]); /* order counters */
x[r1,r2,r3]+1; /* increase selected "set" */
end;
/* output generated data */
do r1=0 to 12;
do r2=0 to 12;
do r3=0 to 12;
if x[r1,r2,r3] then /* print only "materialized" cases */
do;
n=x[r1,r2,r3]/iter;
output;
put @2 r1= @12 r2= @22 r3= @32 n= best32.20;
end;
end;
end;
end;
keep r: n;
run;
Result for seed=42 and 1e9 iterations:
/* 1e9 r1=0 r2=4 r3=8 n=0.000155304 r1=0 r2=5 r3=7 n=0.000993489 r1=0 r2=6 r3=6 n=0.000869773 r1=1 r2=3 r3=8 n=0.000994072 r1=1 r2=4 r3=7 n=0.009935598 r1=1 r2=5 r3=6 n=0.027825111 r1=2 r2=2 r3=8 n=0.000869628 r1=2 r2=3 r3=7 n=0.027822129 r1=2 r2=4 r3=6 n=0.121775441 r1=2 r2=5 r3=5 n=0.097433405 r1=3 r2=3 r3=6 n=0.097416671 r1=3 r2=4 r3=5 n=0.487068081 r1=4 r2=4 r3=4 n=0.126841298 NOTE: The data set WORK.FREQCOUNT has 13 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 11:56.18 cpu time 11:56.14 */
Bart
So looking at the computed probability for 3:4:5 which is 0.487, in the long run the magician should lose money looking for 3:4:5, as the rest of the possibilities add to greater than 0.5 and so are more likely.
Right. You can see this effect even for two classes. and even for sampling with replacement. Flip a fair coin 4 times and look at the ratios of heads to tails.
So, even in that simple probability space, the unequal ratio is higher than you might initially think.
Here I would like to use PROC SURVEYSELECT to mimic SAMPLE() funtion in SAS/IML for sombody was not familiar with IML syntax.
data have;
do i=1 to 8;
ball='Y';output;
ball='B';output;
ball='R';output;
end;
drop i;
run;
proc surveyselect data=have out=temp seed=123 sampsize=12 reps=1000000;
run;
proc freq data=temp noprint;
table Replicate*ball/out=temp2 sparse list;
run;
proc sort data=temp2;
by Replicate count;
run;
data want;
length flag $ 8;
do until(last.Replicate);
set temp2;
by Replicate;
flag=catx(':',flag,count);
end;
keep Replicate flag;
run;
proc freq data=want order=freq;
table flag/plots=freqplot(orient=horizontal scale=percent);
run;
Not the same thing but, it's my favorite "audience bet".
Playig backgammon, and rolling two dice, rolling "doubles" is advantageious (1:1, 2:2, ... 6:6).
Chances of doubles on a single roll is 1 out of 6. I claim that my opponent is exceedingly lucky and if he rolls the dice only 5 times, he will roll doubles at some point.
The math is much easier than the question with three colors of balls.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!