BookmarkSubscribeRSS Feed
Tpham
Quartz | Level 8

Hi there,

I am hoping I can get some guidance from my fellow statisticians. I can't seem to get this code to work correctly, so I am hoping to get some guidance.

I am working on a project where I am looking at predictors of missed appointments.  My data is set in long format, where each record is each year a subject appeared in our clinic. So if a subject have been seen in our clinic for 5 years, he will have 5 records. This would account for time-varying variables, such as lab values and medication usage.  My outcome variable (y) as a proportion, where it is calculated as number of missed appointment in a year/total appointments that year (missed+not missed).

So my code is as follows (using sample data, there are a lot more records than this

data fakedata;

input id y x year;

datalines;

1 0.50 1 2007

1 0 0 2008

1 1 1 2009

1 0.65 1 2010

1 0.856 1 2011

2 0.24 0 2008

2 1 0 2009

2 0.89 0 2010

3 0.36 1 2006

3 0 0 2007

3 0 1 2008

3 0 1 2009

3 0 1 2010

;

run;

proc genmod data=fakedata;

class year id ;

model y = x / dist=poisson link=log  offset=ln;

repeated subject=id;

run;

My first question is, from reading up on modeling rate (http://support.sas.com/kb/24/188.html), it looks like I need to take the log of  a variable and use it in the offset option. From the sample documentation I've linked, they were taking the log of the size of the population, which is not the case for my analysis. Therefore, I am unsure what variable am I suppose to take the log to account for the rate/proportion in my outcome variable. Am I suppose to take the log of my outcome variable and include that in the offset option?

Secondly, I think I did the nesting/clusrting correctly in the proc genmod procedure. I took this from the SAS documentation (http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_genmod_sect0...). The documentation states it's for GEE models, so I am not 100% sure if this is the right statement for me. I know there is an option to use a Poisson family with GEE. But I don't think I am looking at population average (which is GEE) in my case.  So I am unsure if using this option to account for nesting is correct. I am accounting for only individual correlations due to the nesting.


I am hoping someone can provide me with some guidance, so I can run this analysis correctly. I understand how to run it in STATA, but sadly I don't have that option.

Thank you so much in advanced for your help.

9 REPLIES 9
1zmm
Quartz | Level 8

Modelling the Poisson distribution in PROC GENMOD requires both the numerator and the denominator of the proportion.

In your example, the variable, Y, already has been calculated as a proportion [=# of missed appointments/ (# of total appointments)].  You should take the natural logarithm of the denominator, the number of total appointments, as your offset and the number of missed appointments as your dependent variable.

However, because Poisson regression usually assumes a small numerator relative to a large denominator [and thus a small proportion close to zero], you might instead consider binomial regression with MODEL statement options of DIST=BINOMIAL and LINK=LOG without the need for an offset.

Tpham
Quartz | Level 8

Hello!

Thank you so much for your reply. I think I did it right now. Thanks for your suggestion on using binomial regression, I was planning on using negative binomial regression to account for the differences in condition mean and conditional variances. If I were to do negative binomial regression (DIST=negbin), I assume I am offsetting it the same way as I am with Poisson (below) and using the same nesting command right?

Is my code correct in terms of accounting for the rate and clustering now?

data fakedata;

input id x year missed total;

datalines;

1 1 2007 1 5

1 0 2008 0 2

1 1 2009 2 3

1 1 2010 4 5

1 1 2011 0 3

2 0 2008 4 9

2 0 2009 3 5

2 0 2010 0 9

3 1 2006 5 8

3 0 2007 7 5

3 1 2008 3 6

3 1 2009 7 12

3 1 2010 0 3

;

run;

data fake1data1;

set fakedata;

y=missed;

ln=log(total);

run;

proc genmod data=fake1data1;

class year id ;

model missed = x / dist=poisson link=log  offset=ln;

repeated subject=id;

run;

1zmm
Quartz | Level 8

Yes.

You can include the same MODEL statement option, OFFSET, and change the MODEL statement option, DIST, from DIST=POISSON to DIST=NEGBIN:

  model missed = x / dist=negbin link=log offset=ln;

However, your original denominator, TOTAL, is relatively small compared to your numerator, MISSED.  I would also try binomial regression, without an offset:

   model missed/total = x / dist=binomial link=log;

Why have you omitted the classification variable, YEAR, from your model?

SteveDenham
Jade | Level 19

Why the log link rather than the canonical logit for a binomial?  I have seen this a couple of times lately, and figure there is something else I need to learn about generalized models.  Is it because the bulk of the observations are near the lower boundary?

Thanks,

Steve Denham

Message was edited by: Steve Denham

1zmm
Quartz | Level 8

Binomial regression (LINK=LOG) differs from logistic regression (LINK=LOGIT) in that binomial regression uses as its measure of effect the relative risk, the ratio of two probabilities, and that logistic regression uses as its measure of effect the odds ratio, the ratio of two odds.  When the reference probability [=denominator of the relative risk] is larger (say > 0.20), then the odds ratio can be much larger or much smaller than the relative risk (deviating away from the null value of 1.00).  For comparison of prevalences, this exaggeration of the odds ratio is somewhat disconcerting.  For example, if the prevalence in one group is 0.7, and the prevalence in another group is 0.4,

    the relative risk = 0.7/0.4 = 1.75, and

    the odds ratio = (0.7/0.3) / (0.4/0.6) = 3.50.

For the general public (who are not gamblers), it is also easier to explain probabilities and relative risks than odds and odds ratios.

The disadvantage of binomial regression is that, sometimes, it may estimate probabilities for observations that exceed 1.00, but methods have been developed to constrain these estimated probabilities to be less than 1.00.  With logistic regression, estimated probabilities can never exceed 1.00.

SteveDenham
Jade | Level 19

THANK YOU!  I have been stuck on relative risk estimation (as opposed to odds ratio) and how to get from one to the other since the late '90's.  Do you have a good reference on this (and especially on the constrained regression)?

Steve Denham

Rick_SAS
SAS Super FREQ

Based on 1zmm's explanation, I found some lecture notes from the biostats dept an U. MN (my former employer, Go Gophers!) that goes into more details and has some good examples that you can run in PROC GENMOD: http://www.biostat.umn.edu/~will/6470stuff/Fall-2008/Lect20/lecture20H.pdf

1zmm
Quartz | Level 8

Ref:  Deddens JA, Petersen MR, Lei X.  Estimation of prevalence ratios when PROC GENMOD does not converge.  SAS Users Group International 28, paper 270 at http://www2.sas.com/proceedings/sugi28/270-28.pdf.

This reference provides a method to estimate prevalence ratios and predicted probabilities within proper bounds with PROC GENMOD.  A later citation to this reference suggested that one could achieve the same results by using FREQ statement instead of copying the observations multiple times.

Tpham
Quartz | Level 8

Thanks 1zmm.. The data I showed is fake data (I made it up on the fly). I included year to show the differences in the nesting I guess.

I am debating on adding year into the model. I will most likely will, since the work I am doing is in the HIV clinic. The guidelines on appointments remained the same throughout the study period. But I know the medication prescribing has changed over the years, which I would account for using my medication covariates.

Rick: thank you so much for sharing those notes.  Just skiming it real quick, it looks helpful Smiley Happy

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 3940 views
  • 1 like
  • 4 in conversation