05242017 10:51 AM
Hi all,
I need your advice urgently to calculate the ‘confidence intervals’ of an average proportion. Please refer to example Table 1 for the original data. My final table has to look like Table 2.
Table 1. displays the example of the calculation of the ‘average proportion’ of the population for ‘event 1’. The numerator is the count of ‘event1_of interest’ across the time periods (period 1 to period 4) divided by the ‘total count of the time periods’ for the id excluding the period with ‘missing value’ for the event1. After identifying the numerator and denominator for each id, I have to calculate the proportion for each id and then the ‘average proportion’ for the population which is displayed in Table 1.
I have to then estimate the ‘confidence intervals’ for the ‘average proportion’ for each event for the population and create Table 2.
I am looking to calculate the counts(numerator, denominator) and proportion using proc sql. How can I calculate the confidence intervals for the average proportion? Please let me know as soon as possible. Thanks.
Table 1 






ID 
Period1 
Period2 
Period3 
Period4 
Proportion of Event1_Time in target 
Proportion Calculation 
ID1 
Event1_of interest 
Missing 
Event1_not of interest 
Event_of interest 
= 2 / 3 
=0.66 
ID2 
Event1_not of interest 
Event1_not of interest 
Missing 
missing 
= 0 / 2 
=0 
ID3 
Event1_not of interest 
Event1_of interest 
Event1_of interest 
Event1_of interest 
= 3 / 4 
=0.75 
Average_Proportion 





=(0.66+0+0.75)/3=0.47

Table 2 



Average_Proportion 
95% Confidence Limit 
Event 

Event 1 
0.47 

Event 2 


Event 3 


Event 4 


05242017 11:49 AM
Are you sure you are calculating your "mean" correctly? Since each of the ID values has a different N of measurements, 3, 2 and 4 then averaging the id value means is similar to calculating overall mpg when you have 1 mile at 50mpg and 500 miles at 10 mpg and saying the average mpg is 30.
Plus you round too early.
Here's an approach that implements what you have requested.
data have; input id period value; label value='Event 1'; datalines; 1 1 1 1 2 . 1 3 0 1 4 1 2 1 0 2 2 0 2 3 . 2 4 . 3 1 0 3 2 1 3 3 1 3 4 1 ; run; proc summary data=have nway; class id; var value; output out=idsum mean=; run; proc means data=idsum mean lclm uclm; var value; run;
A value of 1 = 'of interest' 0='not of interest'.
Note the example of how to post data that someone can acually use to test code with.
I would not actually use this code due to the concern about your "mean" per Id. Also the sample size is so low (3) and range of responses so large (0 to .75) the confidence limits will be outside of plausible.
And since you didn't provide any idea of what event 2 or 3 or 4 may be dont' expect a result.
05242017 10:55 AM
Try PROC FREQ instead. Look at BINOMIAL option.
05242017 11:49 AM
Are you sure you are calculating your "mean" correctly? Since each of the ID values has a different N of measurements, 3, 2 and 4 then averaging the id value means is similar to calculating overall mpg when you have 1 mile at 50mpg and 500 miles at 10 mpg and saying the average mpg is 30.
Plus you round too early.
Here's an approach that implements what you have requested.
data have; input id period value; label value='Event 1'; datalines; 1 1 1 1 2 . 1 3 0 1 4 1 2 1 0 2 2 0 2 3 . 2 4 . 3 1 0 3 2 1 3 3 1 3 4 1 ; run; proc summary data=have nway; class id; var value; output out=idsum mean=; run; proc means data=idsum mean lclm uclm; var value; run;
A value of 1 = 'of interest' 0='not of interest'.
Note the example of how to post data that someone can acually use to test code with.
I would not actually use this code due to the concern about your "mean" per Id. Also the sample size is so low (3) and range of responses so large (0 to .75) the confidence limits will be outside of plausible.
And since you didn't provide any idea of what event 2 or 3 or 4 may be dont' expect a result.
05242017 01:32 PM
Hi @ballardw. Thanks for your reply.Yes, I should have known to display my data in a form that can be tested. Also, I realized after that proc sql will require me to transpose the data as count will not be applicable to multiple columns.
I wanted to avoid that. If possible, I would like to use the sum function. I anyways have to convert the values of period1 to period5 to numeric. I also have blanks in the data that have to be converted into missing numeric values. So on a clean slate here is my data and a data manipulation question as well. The values of interest to me are 'in_target'. The other values are 'low' and 'high' which are the 'event vaues not of interest'. 'missing' and blank values both need to be treated as 'missing'.
data have;
input id period1 period2 period3 period4 period5
datalines;
1 in_target
2 in_target missing low in_target
3 high high missing missing missing
4 low in_target in_target in_target in_target
run;
I am looking to convert the values of in_target to 1 , low and high as '0' ,missing and blank with a 'period'.
I used the array
array ch(*) $ period:;
array nu(*) periodnew:;
do i = 1 to dim(ch);
if ch(i)='in_target' then do;
nu(i)=1;
end;
else if ch(i)='low' then do;
nu(i)=0;
end;
else if ch(i)='high' then do;
nu(i)=0;
end;
else if ch(i)='missing' then do;
nu(i)=.;
end;
else if ch(i)=' ' then do;
nu(i)=.;
end;
end;
run;
I am getting an error stating
ERROR: Array subscript out of range at line # column #
Could you first please tell me what is wrong with my array code?
05242017 04:11 PM
Depending on what I wanted my data to look like I might try this to have the ID Period Eventvalue:
proc format library=work;
invalue event
'in_target'=1
'low','high'=0
other=.
;
run;
data have;
input id @;
do period = 1 to 5 ;
input event @;
output;
end;
input;
informat event event.;
datalines;
1 in_target . . . .
2 in_target missing . low in_target
3 high high missing missing missing
4 low in_target in_target in_target in_target
;
run;
Note that I placed . in the example data to simplify reading the datalines. Missing values are a real problem in datalines if no delimiter or fixed column read is done. Note that posting code in the code box using the forum {i} or run box works better. Also the main message box area will reformat text and has been know to convert some stuff to hidden (html tag perhaps) that interferes with datastep code.
Custom informats are a VERY handy tool for reading and converting known values early on in a process. I routinely use about 80 of these and often develop them for one shot projects to reduce horrific number of If/then/else or select blocks.
Or if you must maintain the multiple period variables:
data have; informat id best5. period1 period2 period3 period4 period5 event.; input id period1 period2 period3 period4 period5 ; datalines; 1 in_target . . . . 2 in_target missing . low in_target 3 high high missing missing missing 4 low in_target in_target in_target in_target ; run;
05242017 05:52 PM
Thanks @ballardw for the tips. Actually the array worked for me and I converted the character values to numeric after tweaking it.
I gave a value of :
'1' to the event1 value of interest (which is "in_target")
' 2' to those of event1 values of no interest and
a period for missing and blank values.
I am now concerned about calculating the average proportion for each event and then its confidence interval.
@Reeza , I am looking to use the proc freq with the binomial option. At present I have transposed my dataset and it looks like this:
Value '1' for variable "Event" is the event value fo interest.
ID 
Period 
Event 
1 
period1_event1 
1 
1 
period2_event1 
. 
1 
period3_event1 
. 
1 
period4_event1 
. 
2 
period1_event1 
2 
2 
period2_event1 
1 
2 
period3_event1 
. 
2 
period4_event1 
. 
3 
period1_event1 
1 
3 
period2_event1 
1 
3 
period3_event1 
. 
3 
period4_event1 
. 
Is this how I should be executing the proc freq with the binomial option?
proc freq data=have;
ods output Binomial=ci;
by id;
tables event / binomial;
run;
If 'yes' then I will be getting proportion and Upper Conf Limit and Lower Conf Limit by each id. Do I then extract all these estimates for each id from output dataset 'ci' and take its average for the population? Please advice.
Also, another way I apporached this is to calculate the numerator,denominator and proportion for each distint id and then calculate the average proportion for the population.
This is how my output looks for the population for event 1:
Table 1 








ID 
Period1 
Period2 
Period3 
Period4 
Numerator 
Denominator 
Proportion Calculation 
Average_proportion 
ID1 
in_target 
Missing 
low 
in_target 
2 
3 
=0.66 
(0.66+0+0.75)/3=0.47 
ID2 
low 
low 
Missing 
missing 
0 
2 
=0 

ID3 
low 
in_target 
in_target 
in_target 
3 
4 
=0.75 

Should I calculate the confidence intervals manually using
P = PERCENT/100 ;
N = COUNT + LAG(COUNT) ;
LB = P  ( 1.96*SQRT( P*(1P)/N ) ) ;
* reset lower bound to 0 if <0 ;
IF LB < 0 THEN LB = 0 ;
UB = P + ( 1.96*SQRT( P*(1P)/N ) ) ;
* reset upper bound to 1 if >1 ;
IF UB > 1 Then UB = 1 ;
What would be the most efficiat way to do this? Please reaply as soon as possible. Much thanks for your help.
05312017 12:18 PM
@ballardw Thanks much for the solution! I realized you had provided it already and it was vey helpful.
Need further help from the community? Please ask a new question.