Solved: Re: My poisson probabilities are not summing up to 1

john1111 · Posted 12-11-2018 08:40 AM

I have written a procedure that uses a specific estimate to compute Poisson probabilities for values between 0 to 80.

But at specific values (5, 12, 20, 30, 45, 50, 60), the procedure for computing the probability is a little different as in the code.

DATA xxxx2;
  SET final_sim_data;
  estimate=3.99793;
  IF x1~=1 or x2~=1 or x3~=1 then DELETE;
run;

data pred; 
set xxxx2;
  array v[11];
  array n[11];
  array sf[11];
  array t[11];
  array f[11];
  array a[11];
  array w[11];
  do i = 1 to 11;
	n[i] = i + 14;
   end;
  do i = 1 to 11;
	sf[i] = i + 44;
   end;
  do i = 1 to 11;
     v[i] = i + 54;
   end;
  do i = 1 to 11;
     t[i] = i + 24;
   end;
  do i = 1 to 11;
     f[i] = i + 39;
   end;
  do i = 1 to 11;
     a[i] = i - 1;
   end;
  do i = 1 to 11;
     w[i] = i + 6;
   end;
do count_y= 0 to 80;
if  count_y ~= 5 and count_y ~=12 and count_y ~=20 and count_y ~=30 and count_y ~=45 and count_y ~=50 and count_y ~=60 then 
probability_FP= pdf("poisson",count_y,exp(estimate));
else if count_y= 20 then do;
    p = 0;
     do i = 1 to 6;
      p = p + ((n[i]-15)/5)*(pdf("poisson",n[i],exp(estimate)));
     end;
     do i = 7 to 11;
      p = p + ((25-n[i])/5)*(pdf("poisson",n[i],exp(estimate)));
     end;
     probability_FP=p;
    end;
else if count_y= 60 then do;
     sum = 0;
     do i = 1 to 6;
      sum = sum + ((v[i]-55)/5)*(pdf("poisson",v[i],exp(estimate)));
     end;
     do i = 7 to 11;
      sum = sum + ((65-v[i])/5)*(pdf("poisson",v[i],exp(estimate)));
     end;
     probability_FP=sum;
  end;
else if count_y= 30 then do;
    th = 0;
     do i = 1 to 6;
      th = th + ((t[i]-25)/5)*(pdf("poisson",t[i],exp(estimate)));
     end;
     do i = 7 to 11;
      th = th + ((35-t[i])/5)*(pdf("poisson",t[i],exp(estimate)));
     end;
     probability_FP=th;
    end;
else if count_y= 50 then do;
    sd = 0;
     do i = 1 to 6;
      sd = sd + ((sf[i]-45)/5)*(pdf("poisson",sf[i],exp(estimate)));
     end;
     do i = 7 to 11;
      sd = sd + ((55-sf[i])/5)*(pdf("poisson",sf[i],exp(estimate)));
     end;
     probability_FP=sd;
    end;
else if count_y= 12 then do;
    bb = 0;
     do i = 1 to 6;
      bb = bb + ((w[i]-7)/5)*(pdf("poisson",w[i],exp(estimate)));
     end;
     do i = 7 to 11;
      bb = bb + ((17-w[i])/5)*(pdf("poisson",w[i],exp(estimate)));
     end;
     probability_FP=bb;
    end;
else if count_y= 5 then do;
    aa = 0;
     do i = 1 to 6;
      aa = aa + ((a[i]-0)/5)*(pdf("poisson",a[i],exp(estimate)));
     end;
     do i = 7 to 11;
      aa = aa + ((10-a[i])/5)*(pdf("poisson",a[i],exp(estimate)));
     end;
     probability_FP=aa;
    end;
else if count_y= 45 then do;
    fr = 0;
     do i = 1 to 6;
      fr = fr + ((f[i]-40)/5)*(pdf("poisson",f[i],exp(estimate)));
     end;
     do i = 7 to 11;
      fr = fr + ((50-f[i])/5)*(pdf("poisson",f[i],exp(estimate)));
     end;
     probability_FP=fr;
    end;
output; 
end;
keep count_y probability_FP;
run;



/********************/
proc means data=pred;
class count_y;
output out=means mean(probability_FP)=;
run;

1. I am sure the code is so manual (I mean it could be shortened) especially because I am averaging the same value in the last proc code. Is there a way I can shorten everything so that I avoid the proc means code?

2. After getting the probabilities for count_y = 0,1,2...80. I notice that the probabilities are not summing to 1. What could be the problem, and Is there a way out.

FreelanceReinh · Posted 12-11-2018 11:50 AM

Hello @john1111,

You received good advice from @RW9 as to how you could simplify your code. It turns out that dataset PRED (the new version without redundant duplicate observations) can be created with this much shorter code:

data pred; 
set xxxx2(obs=1 keep=estimate);
do count_y=0 to 80;
  if count_y in (5, 12, 20, 30, 45, 50, 60) then do;
    probability_FP=0;
    do i=-4 to 4;
      probability_FP+(5-abs(i))/5*pdf("poisson",count_y+i,exp(estimate));
    end;
  end;
  else probability_FP=pdf("poisson",count_y,exp(estimate));
  output; 
end;
keep count_y probability_FP;
run;

Note that I do use the SET statement from your first approach (in order to avoid hardcoding the value of ESTIMATE), but enhanced with the appropriate dataset options so that only variable ESTIMATE from one observation of dataset XXXX2 is used.

Now it's fairly obvious why the probabilities are not summing up to 1: For the "exceptional" COUNT_Y values (5, 12, etc.) the probabilities that would have been used otherwise (see the case i=0 in my code!) are substantially increased by eight additional terms (see cases i=-4, ..., -1, 1, ..., 4). For example, for COUNT_Y=50 the calculation amounts to:

data _null_;
estimate=3.99793;
x+0.2*pdf("poisson",46,exp(estimate))
 +0.4*pdf("poisson",47,exp(estimate))
 +0.6*pdf("poisson",48,exp(estimate))
 +0.8*pdf("poisson",49,exp(estimate))
 +1.0*pdf("poisson",50,exp(estimate))
 +0.8*pdf("poisson",51,exp(estimate))
 +0.6*pdf("poisson",52,exp(estimate))
 +0.4*pdf("poisson",53,exp(estimate))
 +0.2*pdf("poisson",54,exp(estimate));
put x=;
run;

The result is approximately 0.2266, which replaces pdf("poisson",50,exp(3.99793))=0.0465... in the overall sum, which would be cdf('poisson',80,exp(3.99793))=0.999532... without the exceptions.

Look at the bar plot created by the code below to see what happened:

proc sgplot data=pred;
vbar count_y / response=probability_FP;
run;

View solution in original post

RW9 · Posted 12-11-2018 08:56 AM

Well, a brief look at the code, and I can see a lot which can be narrowed down. Then first set of loops for instance, can be replaced with;

do i = 1 to 11;
  n[i] = i + 14;
  sf[i] = i + 44;
  v[i] = i + 54;
  t[i] = i + 24;
  f[i] = i + 39;
  a[i] = i - 1;
  w[i] = i + 6;
end;

The if could be resolved to:

select(county);
  when (20) do;
    ...;
    end;
  when (60) do;
    ...;
    end;
  otherwise probability_fp=pdf(...);
end;

The do loop blocks within the if statements are repeating. Put your values in another array, then do the loop once, e.g:

array vals{5,2} (15,25,
                          55,65,
                           ...);        /* First column is lower, second upper */
array p_res{5,2} 8.;         /* Array of results for p based

select(county);
  when (20) do;
    p=0;
    do i=1 to 6;
      p{i,1}=p{i,1}+((n{i}-vals{i,1})/5)*pdf...;
    end;
    do i=1 to 7;
      p{i,1}=p{i,1}+((vals{i,2}-n{i})/5)*pdf...;
    end;
   end;
...

Do note however that a simple change to your data structure will simplify your coding a lot. Have one row per set of tests, so all v is on one row, all n on another etc. This will simplify down to be one array, for each observation. Also note I have nothing to work with in terms of test data in the form of a datastep, or what its supposed to do, so just generaralising.

I don't know about he stat question.

john1111 · Posted 12-11-2018 10:09 AM

I just realized that there is no need for setting any data and therefore there is no need for the data;

I have also set a specific estimate value as below;

data pred; 
  array v[11];
  array n[11];
  array sf[11];
  array t[11];
  array f[11];
  array a[11];
  array w[11];
  do i = 1 to 11;
	n[i] = i + 14;
   end;
  do i = 1 to 11;
	sf[i] = i + 44;
   end;
  do i = 1 to 11;
     v[i] = i + 54;
   end;
  do i = 1 to 11;
     t[i] = i + 24;
   end;
  do i = 1 to 11;
     f[i] = i + 39;
   end;
  do i = 1 to 11;
     a[i] = i - 1;
   end;
  do i = 1 to 11;
     w[i] = i + 6;
   end;
do count_y= 0 to 80;
if  count_y ~= 5 and count_y ~=12 and count_y ~=20 and count_y ~=30 and count_y ~=45 and count_y ~=50 and count_y ~=60 then 
probability_FP= pdf("poisson",count_y,exp(3.99793));
else if count_y= 20 then do;
    p = 0;
     do i = 1 to 6;
      p = p + ((n[i]-15)/5)*(pdf("poisson",n[i],exp(3.99793)));
     end;
     do i = 7 to 11;
      p = p + ((25-n[i])/5)*(pdf("poisson",n[i],exp(3.99793)));
     end;
     probability_FP=p;
    end;
else if count_y= 60 then do;
     sum = 0;
     do i = 1 to 6;
      sum = sum + ((v[i]-55)/5)*(pdf("poisson",v[i],exp(3.99793)));
     end;
     do i = 7 to 11;
      sum = sum + ((65-v[i])/5)*(pdf("poisson",v[i],exp(3.99793)));
     end;
     probability_FP=sum;
  end;
else if count_y= 30 then do;
    th = 0;
     do i = 1 to 6;
      th = th + ((t[i]-25)/5)*(pdf("poisson",t[i],exp(3.99793)));
     end;
     do i = 7 to 11;
      th = th + ((35-t[i])/5)*(pdf("poisson",t[i],exp(3.99793)));
     end;
     probability_FP=th;
    end;
else if count_y= 50 then do;
    sd = 0;
     do i = 1 to 6;
      sd = sd + ((sf[i]-45)/5)*(pdf("poisson",sf[i],exp(3.99793)));
     end;
     do i = 7 to 11;
      sd = sd + ((55-sf[i])/5)*(pdf("poisson",sf[i],exp(3.99793)));
     end;
     probability_FP=sd;
    end;
else if count_y= 12 then do;
    bb = 0;
     do i = 1 to 6;
      bb = bb + ((w[i]-7)/5)*(pdf("poisson",w[i],exp(3.99793)));
     end;
     do i = 7 to 11;
      bb = bb + ((17-w[i])/5)*(pdf("poisson",w[i],exp(3.99793)));
     end;
     probability_FP=bb;
    end;
else if count_y= 5 then do;
    aa = 0;
     do i = 1 to 6;
      aa = aa + ((a[i]-0)/5)*(pdf("poisson",a[i],exp(3.99793)));
     end;
     do i = 7 to 11;
      aa = aa + ((10-a[i])/5)*(pdf("poisson",a[i],exp(3.99793)));
     end;
     probability_FP=aa;
    end;
else if count_y= 45 then do;
    fr = 0;
     do i = 1 to 6;
      fr = fr + ((f[i]-40)/5)*(pdf("poisson",f[i],exp(3.99793)));
     end;
     do i = 7 to 11;
      fr = fr + ((50-f[i])/5)*(pdf("poisson",f[i],exp(3.99793)));
     end;
     probability_FP=fr;
    end;
output; 
end;
keep count_y probability_FP;
run;

So the idea is that I should get probabilities from 0 to 80 but at points (5,12,20,30,45,50,60) the probability is computed as below;

As an example for 20 (estimate in the code is the theta below);

I applied the same idea to 5,12,30,45,50,60 as shown in the code

FreelanceReinh · Posted 12-11-2018 11:50 AM

Hello @john1111,

You received good advice from @RW9 as to how you could simplify your code. It turns out that dataset PRED (the new version without redundant duplicate observations) can be created with this much shorter code:

data pred; 
set xxxx2(obs=1 keep=estimate);
do count_y=0 to 80;
  if count_y in (5, 12, 20, 30, 45, 50, 60) then do;
    probability_FP=0;
    do i=-4 to 4;
      probability_FP+(5-abs(i))/5*pdf("poisson",count_y+i,exp(estimate));
    end;
  end;
  else probability_FP=pdf("poisson",count_y,exp(estimate));
  output; 
end;
keep count_y probability_FP;
run;

Note that I do use the SET statement from your first approach (in order to avoid hardcoding the value of ESTIMATE), but enhanced with the appropriate dataset options so that only variable ESTIMATE from one observation of dataset XXXX2 is used.

Now it's fairly obvious why the probabilities are not summing up to 1: For the "exceptional" COUNT_Y values (5, 12, etc.) the probabilities that would have been used otherwise (see the case i=0 in my code!) are substantially increased by eight additional terms (see cases i=-4, ..., -1, 1, ..., 4). For example, for COUNT_Y=50 the calculation amounts to:

data _null_;
estimate=3.99793;
x+0.2*pdf("poisson",46,exp(estimate))
 +0.4*pdf("poisson",47,exp(estimate))
 +0.6*pdf("poisson",48,exp(estimate))
 +0.8*pdf("poisson",49,exp(estimate))
 +1.0*pdf("poisson",50,exp(estimate))
 +0.8*pdf("poisson",51,exp(estimate))
 +0.6*pdf("poisson",52,exp(estimate))
 +0.4*pdf("poisson",53,exp(estimate))
 +0.2*pdf("poisson",54,exp(estimate));
put x=;
run;

The result is approximately 0.2266, which replaces pdf("poisson",50,exp(3.99793))=0.0465... in the overall sum, which would be cdf('poisson',80,exp(3.99793))=0.999532... without the exceptions.

Look at the bar plot created by the code below to see what happened:

proc sgplot data=pred;
vbar count_y / response=probability_FP;
run;

My poisson probabilities are not summing up to 1

Re: My poisson probabilities are not summing up to 1

Re: My poisson probabilities are not summing up to 1

Re: My poisson probabilities are not summing up to 1

Re: My poisson probabilities are not summing up to 1

Click image to register for webinar

Classroom Training Available!