BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dr2014
Quartz | Level 8

Hi all,

I need your advice urgently to calculate the ‘confidence intervals’ of an average proportion. Please refer to example Table 1 for the original data. My final table has to look like Table 2.

 

Table 1. displays the example of the calculation of the ‘average proportion’ of the population for ‘event 1’. The numerator is the count of ‘event1_of interest’ across the time periods (period 1 to period 4) divided by the ‘total count of the time periods’ for the id excluding the period with ‘missing value’ for the event1. After identifying the numerator and denominator for each id, I have to calculate the proportion for each id and then the ‘average proportion’ for the population which is displayed in Table 1.

 

I have to then estimate the ‘confidence intervals’ for the ‘average proportion’ for each event for the population and create Table 2.

 

I am looking to calculate the counts(numerator, denominator) and proportion using proc sql. How can I calculate the confidence intervals for the average proportion? Please let me know as soon as possible. Thanks.

 

Table 1

 

 

 

 

 

 

ID

Period1

Period2

Period3

Period4

Proportion of Event1_Time in target

Proportion Calculation

ID1

Event1_of interest

Missing

Event1_not of interest

Event_of interest

= 2 / 3

=0.66

ID2

Event1_not of interest

Event1_not of interest

Missing

missing

= 0 / 2

=0

ID3

Event1_not of interest

Event1_of interest

Event1_of interest

Event1_of interest

= 3 / 4

=0.75

Average_Proportion

 

 

 

 

 

=(0.66+0+0.75)/3=0.47

 

Table 2

 

 

 

Average_Proportion

95% Confidence Limit

Event

Event 1

0.47

 

Event 2

 

 

Event 3

 

 

Event 4

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Are you sure you are calculating your "mean" correctly? Since each of the ID values has a different N of measurements, 3, 2  and 4 then averaging the id value means is similar to calculating overall mpg when you have 1 mile at 50mpg and 500 miles at 10 mpg and saying the average mpg is 30.

 

Plus you round too early.

 

Here's an approach that implements what you have requested.

data have;
   input id period value;
   label value='Event 1';
datalines;
1 1 1
1 2 .
1 3 0
1 4 1
2 1 0
2 2 0
2 3 .
2 4 .
3 1 0
3 2 1
3 3 1
3 4 1
;
run;

proc summary data=have nway;
   class id;
   var value;
   output out=idsum mean=;
run;

proc means data=idsum mean lclm uclm;
   var value;
run;

A value of 1 = 'of interest' 0='not of interest'.

 

Note the example of how to post data that someone can acually use to test code with.

I would not actually use this code due to the concern about your "mean" per Id. Also the sample size is so low (3) and range of responses so large (0 to .75) the confidence limits will be outside of plausible.

 

And since you didn't provide any idea of what event 2 or 3 or 4 may be dont' expect a result.

View solution in original post

6 REPLIES 6
Reeza
Super User

Try PROC FREQ instead. Look at BINOMIAL option. 

 

 

ballardw
Super User

Are you sure you are calculating your "mean" correctly? Since each of the ID values has a different N of measurements, 3, 2  and 4 then averaging the id value means is similar to calculating overall mpg when you have 1 mile at 50mpg and 500 miles at 10 mpg and saying the average mpg is 30.

 

Plus you round too early.

 

Here's an approach that implements what you have requested.

data have;
   input id period value;
   label value='Event 1';
datalines;
1 1 1
1 2 .
1 3 0
1 4 1
2 1 0
2 2 0
2 3 .
2 4 .
3 1 0
3 2 1
3 3 1
3 4 1
;
run;

proc summary data=have nway;
   class id;
   var value;
   output out=idsum mean=;
run;

proc means data=idsum mean lclm uclm;
   var value;
run;

A value of 1 = 'of interest' 0='not of interest'.

 

Note the example of how to post data that someone can acually use to test code with.

I would not actually use this code due to the concern about your "mean" per Id. Also the sample size is so low (3) and range of responses so large (0 to .75) the confidence limits will be outside of plausible.

 

And since you didn't provide any idea of what event 2 or 3 or 4 may be dont' expect a result.

dr2014
Quartz | Level 8

Hi @ballardw. Thanks for your reply.Yes, I should have known to display my data in a form that can be tested. Also, I realized after that proc sql will require me to transpose the data as count will not be applicable to multiple columns.

 

I wanted to avoid that. If possible, I would like to use the sum function. I anyways have to convert the values of period1 to period5 to numeric. I also have blanks in the data that have to be converted into missing numeric values.  So on a clean slate here is my data and a data manipulation question as well. The values of interest to me are 'in_target'. The other values are 'low' and 'high' which are the 'event vaues not of interest'. 'missing' and blank values both need to be treated as 'missing'.

 

data have;
   input id period1  period2  period3 period4 period5

datalines;

1 in_target

2  in_target  missing                    low         in_target

3  high         high          missing   missing  missing             

4  low          in_target   in_target in_target in_target

run;

 

I am looking to convert the values of in_target to 1 , low and high as '0' ,missing and blank with a 'period'.

 

I used the array

array ch(*) $ period:;

array nu(*) periodnew:;

do i = 1 to dim(ch);

if ch(i)='in_target' then do;

nu(i)=1;

end;

else if ch(i)='low' then do;

nu(i)=0;

end;

else if ch(i)='high' then do;

nu(i)=0;

end;

else if ch(i)='missing' then do;

nu(i)=.;

end;

else if ch(i)=' ' then do;

nu(i)=.;

end;

end;

run;

 

I am getting an error stating

ERROR: Array subscript out of range at line # column #

 

Could you first please tell me what is wrong with my array code?

 

 

 

ballardw
Super User

Depending on what I wanted my data to look like I might try this to have the ID Period Eventvalue:

proc format library=work;
invalue event
'in_target'=1
'low','high'=0
other=.
;
run;

data have;
   input id @;
   do period = 1 to 5 ;
      input event @;
      output;
   end;
   input;
   informat event event.;
datalines;
1  in_target  .           .          .         .            
2  in_target  missing     .          low       in_target    
3  high       high         missing   missing   missing             
4  low        in_target   in_target  in_target in_target    
;
run;

Note that I placed . in the example data to simplify reading the datalines. Missing values are a real problem in datalines if no delimiter or fixed column read is done. Note that posting code in the code box using the forum {i} or run box works better. Also the main message box area will reformat text and has been know to convert some stuff to hidden (html tag perhaps) that interferes with datastep code.

 

 

Custom informats are a VERY handy tool for reading and converting known values early on in a process. I routinely use about 80 of these and often develop them for one shot projects to reduce horrific number of If/then/else or select blocks.

Or if you must maintain the multiple period variables:

data have;
  informat id best5. period1  period2  period3 period4 period5 event.;
  input id period1  period2  period3 period4 period5  ;
datalines;
1 in_target .         .         .         .            
2 in_target missing   .         low       in_target    
3 high      high      missing   missing   missing             
4 low       in_target in_target in_target in_target    
;
run;
dr2014
Quartz | Level 8

Thanks @ballardw for the tips. Actually the array worked for me and I converted the character values to numeric after tweaking it.

I gave a value of :

'1' to the event1 value of interest (which is "in_target")

' 2' to those of event1 values of no interest and

a period for missing and blank values. 

 

I am now concerned about calculating the average proportion for each event and then its confidence interval.

 

@Reeza , I  am looking to use the proc freq with the binomial option. At present I have transposed my dataset and it looks like this:

Value '1' for variable "Event" is the event value fo interest.

 

ID

Period

Event

1

period1_event1

1

1

period2_event1

.

1

period3_event1

.

1

period4_event1

.

2

period1_event1

2

2

period2_event1

1

2

period3_event1

.

2

period4_event1

.

3

period1_event1

1

3

period2_event1

1

3

period3_event1

.

3

period4_event1

.

 

Is this how I should be executing the proc freq with the binomial option?

 

proc freq data=have;

     ods output Binomial=ci;

     by id;

     tables event / binomial;

run;



If 'yes' then I will be getting proportion and Upper Conf Limit and Lower Conf Limit by each id. Do I then extract all these estimates for each id from output dataset 'ci' and take its average for the population? Please advice.



Also, another way I apporached this is to calculate the numerator,denominator and proportion for each distint id and then calculate the average proportion for the population. 



This is how my output looks for the population for event 1:

Table 1

 

 

 

 

 

 

 

 

ID

Period1

Period2

Period3

Period4

Numerator

Denominator

Proportion Calculation

Average_proportion

ID1

in_target

Missing

low

in_target

2

3

=0.66

(0.66+0+0.75)/3=0.47

ID2

low

low

Missing

missing

0

2

=0

 

ID3

low

in_target

in_target

in_target

3

4

=0.75

 

Should I calculate the confidence intervals manually using

P = PERCENT/100 ;

N = COUNT + LAG(COUNT) ;

LB = P - ( 1.96*SQRT( P*(1-P)/N ) ) ;

* reset lower bound to 0 if <0 ;

IF LB < 0 THEN LB = 0 ;

UB = P + ( 1.96*SQRT( P*(1-P)/N ) ) ;

* reset upper bound to 1 if >1 ;

IF UB > 1 Then UB = 1 ;

 

What would be the most efficiat way to do this? Please reaply as soon as possible. Much thanks for your help.

dr2014
Quartz | Level 8

@ballardw Thanks much for the solution!  I realized you had provided it already and it was vey helpful.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2670 views
  • 0 likes
  • 3 in conversation