Solved: Re: Calculate the confidence intervals for the average proportion

dr2014 · Posted 05-24-2017 10:51 AM

Hi all,

I need your advice urgently to calculate the ‘confidence intervals’ of an average proportion. Please refer to example Table 1 for the original data. My final table has to look like Table 2.

Table 1. displays the example of the calculation of the ‘average proportion’ of the population for ‘event 1’. The numerator is the count of ‘event1_of interest’ across the time periods (period 1 to period 4) divided by the ‘total count of the time periods’ for the id excluding the period with ‘missing value’ for the event1. After identifying the numerator and denominator for each id, I have to calculate the proportion for each id and then the ‘average proportion’ for the population which is displayed in Table 1.

I have to then estimate the ‘confidence intervals’ for the ‘average proportion’ for each event for the population and create Table 2.

I am looking to calculate the counts(numerator, denominator) and proportion using proc sql. How can I calculate the confidence intervals for the average proportion? Please let me know as soon as possible. Thanks.

Table 1
ID	Period1	Period2	Period3	Period4	Proportion of Event1_Time in target	Proportion Calculation
ID1	Event1_of interest	Missing	Event1_not of interest	Event_of interest	= 2 / 3	=0.66
ID2	Event1_not of interest	Event1_not of interest	Missing	missing	= 0 / 2	=0
ID3	Event1_not of interest	Event1_of interest	Event1_of interest	Event1_of interest	= 3 / 4	=0.75
Average_Proportion						=(0.66+0+0.75)/3=0.47

Table 2
	Average_Proportion	95% Confidence Limit
Event
Event 1	0.47
Event 2
Event 3
Event 4

ballardw · Posted 05-24-2017 11:49 AM

Are you sure you are calculating your "mean" correctly? Since each of the ID values has a different N of measurements, 3, 2 and 4 then averaging the id value means is similar to calculating overall mpg when you have 1 mile at 50mpg and 500 miles at 10 mpg and saying the average mpg is 30.

Plus you round too early.

Here's an approach that implements what you have requested.

data have;
   input id period value;
   label value='Event 1';
datalines;
1 1 1
1 2 .
1 3 0
1 4 1
2 1 0
2 2 0
2 3 .
2 4 .
3 1 0
3 2 1
3 3 1
3 4 1
;
run;

proc summary data=have nway;
   class id;
   var value;
   output out=idsum mean=;
run;

proc means data=idsum mean lclm uclm;
   var value;
run;

A value of 1 = 'of interest' 0='not of interest'.

Note the example of how to post data that someone can acually use to test code with.

I would not actually use this code due to the concern about your "mean" per Id. Also the sample size is so low (3) and range of responses so large (0 to .75) the confidence limits will be outside of plausible.

And since you didn't provide any idea of what event 2 or 3 or 4 may be dont' expect a result.

View solution in original post

Reeza · Posted 05-24-2017 10:55 AM

Try PROC FREQ instead. Look at BINOMIAL option.

ballardw · Posted 05-24-2017 11:49 AM

Are you sure you are calculating your "mean" correctly? Since each of the ID values has a different N of measurements, 3, 2 and 4 then averaging the id value means is similar to calculating overall mpg when you have 1 mile at 50mpg and 500 miles at 10 mpg and saying the average mpg is 30.

Plus you round too early.

Here's an approach that implements what you have requested.

data have;
   input id period value;
   label value='Event 1';
datalines;
1 1 1
1 2 .
1 3 0
1 4 1
2 1 0
2 2 0
2 3 .
2 4 .
3 1 0
3 2 1
3 3 1
3 4 1
;
run;

proc summary data=have nway;
   class id;
   var value;
   output out=idsum mean=;
run;

proc means data=idsum mean lclm uclm;
   var value;
run;

A value of 1 = 'of interest' 0='not of interest'.

Note the example of how to post data that someone can acually use to test code with.

I would not actually use this code due to the concern about your "mean" per Id. Also the sample size is so low (3) and range of responses so large (0 to .75) the confidence limits will be outside of plausible.

And since you didn't provide any idea of what event 2 or 3 or 4 may be dont' expect a result.

dr2014 · Posted 05-24-2017 01:32 PM

Hi @ballardw. Thanks for your reply.Yes, I should have known to display my data in a form that can be tested. Also, I realized after that proc sql will require me to transpose the data as count will not be applicable to multiple columns.

I wanted to avoid that. If possible, I would like to use the sum function. I anyways have to convert the values of period1 to period5 to numeric. I also have blanks in the data that have to be converted into missing numeric values. So on a clean slate here is my data and a data manipulation question as well. The values of interest to me are 'in_target'. The other values are 'low' and 'high' which are the 'event vaues not of interest'. 'missing' and blank values both need to be treated as 'missing'.

data have;
input id period1 period2 period3 period4 period5

datalines;

1 in_target

2 in_target missing low in_target

3 high high missing missing missing

4 low in_target in_target in_target in_target

run;

I am looking to convert the values of in_target to 1 , low and high as '0' ,missing and blank with a 'period'.

I used the array

array ch(*) $ period:;

array nu(*) periodnew:;

do i = 1 to dim(ch);

if ch(i)='in_target' then do;

nu(i)=1;

end;

else if ch(i)='low' then do;

nu(i)=0;

end;

else if ch(i)='high' then do;

nu(i)=0;

end;

else if ch(i)='missing' then do;

nu(i)=.;

end;

else if ch(i)=' ' then do;

nu(i)=.;

end;

run;

I am getting an error stating

ERROR: Array subscript out of range at line # column #

Could you first please tell me what is wrong with my array code?

ballardw · Posted 05-24-2017 04:11 PM

Depending on what I wanted my data to look like I might try this to have the ID Period Eventvalue:

proc format library=work;
invalue event
'in_target'=1
'low','high'=0
other=.
;
run;

data have;
   input id @;
   do period = 1 to 5 ;
      input event @;
      output;
   end;
   input;
   informat event event.;
datalines;
1  in_target  .           .          .         .            
2  in_target  missing     .          low       in_target    
3  high       high         missing   missing   missing             
4  low        in_target   in_target  in_target in_target    
;
run;

Note that I placed . in the example data to simplify reading the datalines. Missing values are a real problem in datalines if no delimiter or fixed column read is done. Note that posting code in the code box using the forum {i} or run box works better. Also the main message box area will reformat text and has been know to convert some stuff to hidden (html tag perhaps) that interferes with datastep code.

Custom informats are a VERY handy tool for reading and converting known values early on in a process. I routinely use about 80 of these and often develop them for one shot projects to reduce horrific number of If/then/else or select blocks.

Or if you must maintain the multiple period variables:

data have;
  informat id best5. period1  period2  period3 period4 period5 event.;
  input id period1  period2  period3 period4 period5  ;
datalines;
1 in_target .         .         .         .            
2 in_target missing   .         low       in_target    
3 high      high      missing   missing   missing             
4 low       in_target in_target in_target in_target    
;
run;

dr2014 · Posted 05-24-2017 05:52 PM

Thanks @ballardw for the tips. Actually the array worked for me and I converted the character values to numeric after tweaking it.

I gave a value of :

'1' to the event1 value of interest (which is "in_target")

' 2' to those of event1 values of no interest and

a period for missing and blank values.

I am now concerned about calculating the average proportion for each event and then its confidence interval.

@Reeza , I am looking to use the proc freq with the binomial option. At present I have transposed my dataset and it looks like this:

Value '1' for variable "Event" is the event value fo interest.

ID	Period	Event
1	period1_event1	1
1	period2_event1	.
1	period3_event1	.
1	period4_event1	.
2	period1_event1	2
2	period2_event1	1
2	period3_event1	.
2	period4_event1	.
3	period1_event1	1
3	period2_event1	1
3	period3_event1	.
3	period4_event1	.

Is this how I should be executing the proc freq with the binomial option?

proc freq data=have;

ods output Binomial=ci;

by id;

tables event / binomial;

run;

If 'yes' then I will be getting proportion and Upper Conf Limit and Lower Conf Limit by each id. Do I then extract all these estimates for each id from output dataset 'ci' and take its average for the population? Please advice.

Also, another way I apporached this is to calculate the numerator,denominator and proportion for each distint id and then calculate the average proportion for the population.

This is how my output looks for the population for event 1:

Table 1
ID	Period1	Period2	Period3	Period4	Numerator	Denominator	Proportion Calculation	Average_proportion
ID1	in_target	Missing	low	in_target	2	3	=0.66	(0.66+0+0.75)/3=0.47
ID2	low	low	Missing	missing	0	2	=0
ID3	low	in_target	in_target	in_target	3	4	=0.75

Should I calculate the confidence intervals manually using

P = PERCENT/100 ;

N = COUNT + LAG(COUNT) ;

LB = P - ( 1.96*SQRT( P*(1-P)/N ) ) ;

* reset lower bound to 0 if <0 ;

IF LB < 0 THEN LB = 0 ;

UB = P + ( 1.96*SQRT( P*(1-P)/N ) ) ;

* reset upper bound to 1 if >1 ;

IF UB > 1 Then UB = 1 ;

What would be the most efficiat way to do this? Please reaply as soon as possible. Much thanks for your help.

dr2014 · Posted 05-31-2017 12:18 PM

@ballardw Thanks much for the solution! I realized you had provided it already and it was vey helpful.