Solved: Re: Simulating conditional probabilities in SAS

SimRock · Posted 05-06-2017 04:58 PM

Hi,

I would like to simulate two events A and B knowing their conditional probabilities as follow

P(A) =0.2

P(B)=0.3

P(A and B) =0.1

I remember that P(A or B) = P(A) + P(B) - P(A and B)

How do I incorporate the probability of the joint event (A and B) in the simulation?

Thank you

Rick_SAS · Posted 05-08-2017 09:29 AM

Sorry @Astounding, your code is fine. I missed it because I was studying @mkeintz's simulation from the joint probability.

My preference would be to modify Astounding's code slightly to use RAND, but the logic is the same:

data Cond;
do i = 1 to 1000;
   A = rand("Bernoulli", 0.2);
   if A then   B = rand("Bernoulli", 0.5);
   else        B = rand("Bernoulli", 0.25);
   output;
end;

proc freq data=Cond; /* check empirical distribution */
table A B A*B;
run;

View solution in original post

Astounding · Posted 05-06-2017 07:15 PM

You don't. You just assign A and B randomly, and the distribution should work itself out. For example:

data want;

set have;

A_flag = (ranuni(12345) < 0.2);

B_flag = (ranuni(67893) < 0.3);

run;

Since the flags are assigned randomly, their joint distribution should match your expectations. And you can change the initial seeds if you need additional simulations. To check the distributions:

proc freq data=want;

tables a_flag * b_flag / list;

run;

Ksharp · Posted 05-06-2017 10:47 PM

Your question is ambiguous.

Give an example to explain your data simulation question .

mkeintz · Posted 05-07-2017 04:51 PM

Your formula is not exactly relevant to the task your describe. From the infor you provided, you want to determine the probabiliy of (1) A only, (2) B only, (3) neither.

Given P(A)=.2, P(B)=.3, and P(A&B)=.1, you know how to fill out a 2*2 crosstab (A=0 or 1 by B=0 or 1). Let's say you have 100 observations then your table has the following starting points, given the probabilities you cite:

| A=0 A=1 | total

------------------------------------

B=0 | |

B=1 | 10 | 30

-----------------------------------

total | 20 | 100

So you can conclude that

P(A=1,B=0) = P(A) - P(A&B) = .2 - .1 = .1 (N=10)
P(B=0,B=1) = P(B) - P(A&B) = .3 - .1 = .2 (N=20)

P(A=0,B=0) = 1 - P(A&B) - P(A=1,B=0) - P(A=0,B=1) = 1 - .1 - .1 - .2 = .6

generating this table with the italicized values calculated per above

| A=0 A=1 | total

------------------------------------

B=0 | 60 10 | 70

B=1 | 20 10 | 30

-----------------------------------

total | 80 20 | 100

You can now map the range of 0-1 to the above 4 cell, as in

[0.0-0.6] ==> A=0,B=0
(0.6-0.7] ==> A=1,B=0

(0.7-0.9] ==> A=0,B=1

(0.9-1.0] ==> A=1,B=1

This program uses that non-independent table to generate 100 observations using the association you describe:

data t;
  retain P_AandB .1  P_A .2  P_B .3
         P_Aonly P_Bonly P_neither . ;

  if _N_=1 then do;
    P_Aonly =P_A - P_AandB;
    P_Bonly =P_B - P_AandB;
    P_neither  = 1 - P_AandB - P_Aonly - P_Bonly;	
	put (p_:) (=);
  end;

  samp_size=100;

  do N=1 to samp_size;
    A=0; B=0;
    u=ranuni(10598156);
	if      U <= P_neither then;
	else if U >P_neither + P_Aonly + P_Bonly then do; A=1; B=1; end;
	else if U <= P_neither + P_Aonly then A=1;
	else B=1;
	output;
  end;
run;

proc freq; table A*B;run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Astounding · Posted 05-07-2017 07:59 PM

I'm going to amend my earlier answer. Although much of the problem is still not clear (do you even have a data set to work with?), the basic principles and math apply in any case.

If pA=0.2 and pB=0.3, their joint A&B distribution would be 0.06 if selected randomly. Therefore, to raise the hit rate to 0.1 from 0.06, pB needs to be higher when A=1. Since pB is fixed at 0.3, however, pB needs to be lower when A=0. The math is actually pretty easy ... you can see the probabilities that I used here:

data want;

set have;

do replication = 1 to 10;

A_flag = (ranuni(12345) < 0.2);

if A_flag=1 then B_flag = (ranuni(67893) < 0.5);

else B_flag = (ranuni(97531) < 0.25);

output;

end;

run;

If you have an original data set (and it's not clear that you do), this gives you 10 versions of the data set with different A and B values, where each version conforms to the conditions you specified. "Conforms" means on average ... each replication may be slightly off.

s_lassen · Posted 05-08-2017 04:42 AM

Astounding,

A comment to your second answer, using 3 different calls of RANUNI with 3 different seeds:

It is only possible to set the seed for the RANUNI (and other functions in the same family) once in each datastep. The seed is set in the first call to RANUNI, and the compiler disregards the parameters for the later calls. You can try for yourself:

data test;
  do _N_=1 to 10;
  a=ranuni(12345);
  b=ranuni(45678);
  output;
  end;
run;
data test2;
  set test;
  a2=ranuni(12345);
  b2=ranuni(4);
run;

With this program you will see that not only are all the A2 values equal to the A values, the B2 values are allso all equal to the B values. On the other hand, if you switch the two lines in the final datastep:

data test2;
  set test;
  b2=ranuni(4);
  a2=ranuni(12345);
run;

You will see that not only are all the B2 values different from the B values, the A2 values differ from the A values as well.

Regards,

Søren

Rick_SAS · Posted 05-08-2017 07:46 AM

For more information about Soren's response. see "Random number seeds: Only the first seed matters."

Astounding · Posted 05-08-2017 08:10 AM

Thanks. You learn something new every day (at least on a good day).

Regarding my solution, it should work regardless ... as long as the random number generated changes each time RANUNI is called.

Regarding the original problem, it's still not clear whether it is a programming problem or an algebra problem (why did I use 0.5 and 0.25, for example). We'll have to wait to find out what is really needed here.

Rick_SAS · Posted 05-08-2017 08:48 AM

The title of this post uses the term "conditional probability," but your question (which needs to be clarified) does not use the conditional probability. If you are interested in conditional probabilities, use

P(B|A) = P(A and B) / P(A) = 0.1 / 0.2 = 0.5

But if the simulation process is "draw A, then draw B conditional on A," you need knowledge of P(B| ^A).

You can review the rules of probability at

http://stattrek.com/probability/probability-rules.aspx?Tutorial=AP

and the rules of conditional probability at

https://en.wikipedia.org/wiki/Conditional_probability

Astounding · Posted 05-08-2017 09:09 AM

Rick, while I agree, I thought it was reasonable to back into the p(B|^A) by looking at the condition that the overall p(B) was 0.3. That's how I came up with the 0.25. If I'm proven wrong, it won't be the first time.

Rick_SAS · Posted 05-08-2017 09:29 AM

Sorry @Astounding, your code is fine. I missed it because I was studying @mkeintz's simulation from the joint probability.

My preference would be to modify Astounding's code slightly to use RAND, but the logic is the same:

data Cond;
do i = 1 to 1000;
   A = rand("Bernoulli", 0.2);
   if A then   B = rand("Bernoulli", 0.5);
   else        B = rand("Bernoulli", 0.25);
   output;
end;

proc freq data=Cond; /* check empirical distribution */
table A B A*B;
run;

Classroom Training Available!