05-14-2013 10:16 PM
I am hoping I can get some guidance from my fellow statisticians. I can't seem to get this code to work correctly, so I am hoping to get some guidance.
I am working on a project where I am looking at predictors of missed appointments. My data is set in long format, where each record is each year a subject appeared in our clinic. So if a subject have been seen in our clinic for 5 years, he will have 5 records. This would account for time-varying variables, such as lab values and medication usage. My outcome variable (y) as a proportion, where it is calculated as number of missed appointment in a year/total appointments that year (missed+not missed).
So my code is as follows (using sample data, there are a lot more records than this
input id y x year;
1 0.50 1 2007
1 0 0 2008
1 1 1 2009
1 0.65 1 2010
1 0.856 1 2011
2 0.24 0 2008
2 1 0 2009
2 0.89 0 2010
3 0.36 1 2006
3 0 0 2007
3 0 1 2008
3 0 1 2009
3 0 1 2010
proc genmod data=fakedata;
class year id ;
model y = x / dist=poisson link=log offset=ln;
My first question is, from reading up on modeling rate (http://support.sas.com/kb/24/188.html), it looks like I need to take the log of a variable and use it in the offset option. From the sample documentation I've linked, they were taking the log of the size of the population, which is not the case for my analysis. Therefore, I am unsure what variable am I suppose to take the log to account for the rate/proportion in my outcome variable. Am I suppose to take the log of my outcome variable and include that in the offset option?
Secondly, I think I did the nesting/clusrting correctly in the proc genmod procedure. I took this from the SAS documentation (http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_genmod_sect0...). The documentation states it's for GEE models, so I am not 100% sure if this is the right statement for me. I know there is an option to use a Poisson family with GEE. But I don't think I am looking at population average (which is GEE) in my case. So I am unsure if using this option to account for nesting is correct. I am accounting for only individual correlations due to the nesting.
I am hoping someone can provide me with some guidance, so I can run this analysis correctly. I understand how to run it in STATA, but sadly I don't have that option.
Thank you so much in advanced for your help.
05-15-2013 09:42 AM
Modelling the Poisson distribution in PROC GENMOD requires both the numerator and the denominator of the proportion.
In your example, the variable, Y, already has been calculated as a proportion [=# of missed appointments/ (# of total appointments)]. You should take the natural logarithm of the denominator, the number of total appointments, as your offset and the number of missed appointments as your dependent variable.
However, because Poisson regression usually assumes a small numerator relative to a large denominator [and thus a small proportion close to zero], you might instead consider binomial regression with MODEL statement options of DIST=BINOMIAL and LINK=LOG without the need for an offset.
05-15-2013 02:30 PM
Thank you so much for your reply. I think I did it right now. Thanks for your suggestion on using binomial regression, I was planning on using negative binomial regression to account for the differences in condition mean and conditional variances. If I were to do negative binomial regression (DIST=negbin), I assume I am offsetting it the same way as I am with Poisson (below) and using the same nesting command right?
Is my code correct in terms of accounting for the rate and clustering now?
input id x year missed total;
1 1 2007 1 5
1 0 2008 0 2
1 1 2009 2 3
1 1 2010 4 5
1 1 2011 0 3
2 0 2008 4 9
2 0 2009 3 5
2 0 2010 0 9
3 1 2006 5 8
3 0 2007 7 5
3 1 2008 3 6
3 1 2009 7 12
3 1 2010 0 3
proc genmod data=fake1data1;
class year id ;
model missed = x / dist=poisson link=log offset=ln;
05-15-2013 04:42 PM
You can include the same MODEL statement option, OFFSET, and change the MODEL statement option, DIST, from DIST=POISSON to DIST=NEGBIN:
model missed = x / dist=negbin link=log offset=ln;
However, your original denominator, TOTAL, is relatively small compared to your numerator, MISSED. I would also try binomial regression, without an offset:
model missed/total = x / dist=binomial link=log;
Why have you omitted the classification variable, YEAR, from your model?
05-16-2013 10:04 AM
Why the log link rather than the canonical logit for a binomial? I have seen this a couple of times lately, and figure there is something else I need to learn about generalized models. Is it because the bulk of the observations are near the lower boundary?
Message was edited by: Steve Denham
05-16-2013 10:31 AM
Binomial regression (LINK=LOG) differs from logistic regression (LINK=LOGIT) in that binomial regression uses as its measure of effect the relative risk, the ratio of two probabilities, and that logistic regression uses as its measure of effect the odds ratio, the ratio of two odds. When the reference probability [=denominator of the relative risk] is larger (say > 0.20), then the odds ratio can be much larger or much smaller than the relative risk (deviating away from the null value of 1.00). For comparison of prevalences, this exaggeration of the odds ratio is somewhat disconcerting. For example, if the prevalence in one group is 0.7, and the prevalence in another group is 0.4,
the relative risk = 0.7/0.4 = 1.75, and
the odds ratio = (0.7/0.3) / (0.4/0.6) = 3.50.
For the general public (who are not gamblers), it is also easier to explain probabilities and relative risks than odds and odds ratios.
The disadvantage of binomial regression is that, sometimes, it may estimate probabilities for observations that exceed 1.00, but methods have been developed to constrain these estimated probabilities to be less than 1.00. With logistic regression, estimated probabilities can never exceed 1.00.
05-16-2013 10:57 AM
05-16-2013 11:28 AM
Based on 1zmm's explanation, I found some lecture notes from the biostats dept an U. MN (my former employer, Go Gophers!) that goes into more details and has some good examples that you can run in PROC GENMOD: http://www.biostat.umn.edu/~will/6470stuff/Fall-2008/Lect20/lecture20H.pdf
05-16-2013 02:16 PM
Ref: Deddens JA, Petersen MR, Lei X. Estimation of prevalence ratios when PROC GENMOD does not converge. SAS Users Group International 28, paper 270 at http://www2.sas.com/proceedings/sugi28/270-28.pdf.
This reference provides a method to estimate prevalence ratios and predicted probabilities within proper bounds with PROC GENMOD. A later citation to this reference suggested that one could achieve the same results by using FREQ statement instead of copying the observations multiple times.
05-16-2013 11:39 AM
Thanks 1zmm.. The data I showed is fake data (I made it up on the fly). I included year to show the differences in the nesting I guess.
I am debating on adding year into the model. I will most likely will, since the work I am doing is in the HIV clinic. The guidelines on appointments remained the same throughout the study period. But I know the medication prescribing has changed over the years, which I would account for using my medication covariates.
Rick: thank you so much for sharing those notes. Just skiming it real quick, it looks helpful