Hello all,
Not sure if I can post this question here but any input is appreciated! I will delete it if it's not appropriate. Basically, I am unclear about the difference between log-linear model and poisson regression, and not sure which one to use to answer the following research question.
In the experiment, participants were grouped into young/old, treatment/no treatment, white/non-white. Researchers were to collect questionnaires every 2 weeks. It turned out the number of questionnaire collected were less than the original goal. I would like to know if the missingness has something to do with ethnicity, treatment, changes in protocol or interaction between these factors.
My instinct would be to use a Poisson regression with number of questionnaire as the outcome, and ethnicity, treatment and protocol change as the predictors. However, my mentor told me to use a log-linear model to examine the association between these factors. My understanding is that log-linear model examines expected cell counts in n-way contingency table. Can log-linear model answer this question? If yes, does that mean I have to look at the interaction between number of questionnaire and ethnicity, treatment or protocol change in a 4-way table?
When searching online, some people also used log-linear model and poisson regression interchangeably. Are they actually the same thing under certain circumstances?
Thanks!
I don't see what is Poisson about the data. Please explain.
Poisson regression and log-linear modeling are not interchangeable. Maybe there are some rare cases where they give identical results, but they are not the same method.
The data basically looks like this:
Participant Age Treatment Ethnicity Number of questionnaire
01 young yes white 5
02 young no white 6
03 old no non-white 4
.
.
.
I'd assume number of questionnaire per participant is count data so it's appropriate to use Poisson regression?
The log-linear model and the Poisson distribution are certainly compatible, but the relationship is not really intuitive for most of us, so you'll have to put in some study to see your way clear. Consult
http://data.princeton.edu/wws509/notes/c4.pdf
or any other text on categorical data analysis (e.g., Agresti).
"Poisson distribution" and "Poisson regression" are not always the same thing. The jargon is confusing and inconsistent, in my opinion. I like to think that Poisson regression applies to a scenario where you are analyzing a rate by using an offset, but I could also see a generalized linear model (which is a regression model) with a Poisson distribution as being a Poisson regression.
Good question. I think log-linear is usually for category data, Check PROC CATMOD. But for Poisson regression you need to specify an OFFSET option in MODEL statement as @sId has already said.
The comments by @PaigeMiller and @Ksharp inspired a new thought:
Although a person can fit a loglinear model with a Poisson distribution, in your study I don't think a loglinear model is appropriate. Yes, you have categorical predictors, but you can't build a contingency table with your observations.
I would consider an "ANOVA-like" model using a generalized linear (mixed) model. I doubt that the Poisson distribution is appropriate for these data because the number of questionnaires completed is bounded at the upper end by the number of questionnaires attempted, and the number completed does not appear (from your snippet of data) to be small (for example, zero or one) and may approach the number attempted. Instead I would consider analyzing the proportion of questionnaires completed (number collected out of number attempted) as a binomial distribution, if that makes sense and "number attempted" exists.
Yes, you have categorical predictors, but you can't build a contingency table with your observations.
Could you please explain this further? In my opinion, you certainly can build a contingency table with this type of data.
Sloppy thinking and writing on my part, my apologies. I'll try to be more coherent, starting with: Any linear model with a log link (or a log transformation of the response) is a log-linear model. Contingency tables can be analyzed as log-linear models, but not all log-linear models are framed as contingency tables.
Thinking more about a contingency table: I'm considering a table cross-tabulated by age (young/old), treatment (yes/no), and color (white/non-white). I would be able to allocate each participant to one of the eight cells. If the experiment was balanced, then all eight cell counts would be equal. Now, what do I do with the value (i.e., the number of questionnaires) associated with each participant? I would not want to fill each cell with the sum of number of questionnaires over the participants belonging to the cell; that would violate the independence of counts assumption. Perhaps I am not being creative enough, but it seems more straightforward to move to a non-contingency-table model. It's possible that the number of questionnaires could be compatible with a Poisson distribution (hence, the model would be a log-linear model, given the log link for the Poisson; also called a Poisson regression model). Of course, I don't have all the details, but my guess is that a binomial distribution might be better because the number of questionnaires theoretically has an upper bound (i.e., the number of attempted questionnaires)--but the binomial approach would require knowing what the number of attempted questionnaires was for each participant.
Does that make sense, or am I overlooking a critical element?
So, here's my thought
proc freq; table age*treatment*color*num_questionnaires; run;
is indeed a contingency table.
But maybe we are using the term "log-linear model" to mean two different things. I am thinking of log-linear modeling as described at https://onlinecourses.science.psu.edu/stat504/node/117:
Log-linear models go beyond a single summary statistics and specify how the cell counts depend on the levels of categorical variables. They model the association and interaction patterns among categorical variables. The log-linear modeling is natural for Poisson, Multinomial and Product-Mutlinomial sampling. They are appropriate when there is no clear distinction between response and explanatory variables, or there are more than two responses.
This would rule out your statement (which may still be true under a different definition) that "any linear model with a log link is a log-linear model", as such a linear model would have a clear response variable and the above definition indicates there is no distinction between response and explanatory variables. And so under my understanding of "log-linear model", Poisson regression cannot be a log-linear model.
Comment 1: Ah, definitions! Yes, I think we are using two different ones.
McCullagh and Nelder (1989, pp 193-194) take a broad view of the log-linear model, which they define as
log (mu_i) = eta_i = beta^T x_i i = 1, ..., n (eq 6.2)
where beta^T is beta transpose. (See the attached pdf for a prettier rendition.) They say, "All log-linear models have the form (6.2). Variety is created by different forms of model matrices; there is an obvious analogy with analysis-of-variance and linear regression models."
On the other hand, Lindsey (1997, Applying Generalized Linear Models) appears to define a log-linear model in the more narrow sense, as does your link https://onlinecourses.science.psu.edu/stat504/node/117 .
I agree, many references to "log-linear model" are in the context of a contingency table and make a distinction between (1) having no variable that serves as a response ("log-linear model") and (2) having one variable serve as a response ("logit model").
I think a log-linear model for a contingency table having no distinction between response and explanatory variables (case (1) above) could be seen as having log(mu_i) = log(cellcount_i) in eq 6.2
Comment 2: I agree, the call to PROC FREQ does create a contingency table.
Is it a contingency table that I would want to analyze? For example, would there be problems with sparseness? Primarily though, I see the number of questionnaires as a response, although I also see that could be arguable.
These are really the same thing when there is an identifiable response variable. The log-linear model from earlier days was used to model cross-classified data and the relationships among the various categorical variables. There might not be a single "response" variable. This was originally available in SAS via the LOGLIN statement in PROC CATMOD. But the Poisson model does the equivalent and, these days, this generalized linear modeling approach is the preferred analysis. For instance, the "Log-Linear Model, Three Dependent Variables" in the CATMOD documentation can be reproduced in GENMOD using a Poisson model by using the cell counts as the response and specifying the three variables as CLASS predictors.
Oh, how I hated CATMOD. I'm so much happier in the GLM framework 🙂
Thanks for all your input. I think i'm getting the picture. @sld I also have the same question about sparseness for using log-linear model because cell counts for extremely values (i.e., 1 or 12) are almost 0.
@StatDaveI am still a bit confused by how poisson regression and log-linear model can be the same in terms of interpretation. In your example, what would be the identifiable response for the Poisson model? wouldn't it just be the expected cell count stratified by length, time and status? how is it equivalent to one of the factors as the response variable (say status) when you have to put that factor as the predictor in Poisson regression?
Returning to the question of analysis of your study...
Based on what I've gleaned, my preference would be for a model in which the number of questionnaires completed is the response. Now I'm thinking about Poisson versus binomial distribution. In a previous response, you imply that the number of questionnaires completed ranges from 1 to 12. How many attempted questionnaires were there (i.e., what was the maximum number of questionnaires that could be completed), and was that number the same for all participants?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.