Hello,
I need your advice on what type of model to use for my data that is not behaving according to planned.
The study is about bacterial introduction into several hosts (experiments performed on flies).
The data was collected as "0" (fail-to-introduce) and "1" (introduced). The logistic regression (binomial distribution) was run on the data.
The bacteria and hosts were treated as class variables with 5 categories each. The model includes the two variables and their interaction.
However, one of the many problems of the analysis is that the logistic output shows that the model may not a good fit for the data.
Deviance and Pearson Goodness-of-Fit Statistics
Criterion Value DF Value/DF Pr > ChiSq
Deviance 256.2352 200 1.2812 0.0044
Pearson 223.6127 200 1.1181 0.1210
The diagnostic plots show many points that are not fitting in the model:
I checked the distribution of frequencies of the response and I see that there is a high percentage of zeroes. It seems that this is the likely origin of the problem withe the binomial distribution I used with the logistic regression
Can anyone suggest me how to deal with this type of data?
Thank you in advance.
@igforek wrote:
Thank you for your comments. I just checked the logistic with continuous variables ONLY. It does not solve the problem.
With categorical variable only, I had a quasi separation the warnings:
Quasi-complete separation of data points detected.
You had not mentioned that before.
How did you create your categories and categorical values? Did you check them against your outcome variable using PROC FREQ?
It seems like you have a few that overlap - ie one level in one group matches all the records of one level in another group. You're looking for lines of zeros or near zeros in your proc freq except for one column/row. Those are the categories that are causing you issues.
It is called unbalanced data problem.
Try oversampling ,lift up 0:1 percent be 1:1 or 5:1 , otherwise LOGISTIC model have low power .
and use OFFSET= option to adjust the prediction proability .
Below is a condensed matrix of the data.
Each host has a natural bacteria. The bacteria were introduced into new hosts as well as into their native host ("self-introduced"). The diagonal shows the results of "self-introduction"
HostA | HostB | HostC | HostD | HostE | |
BactA | # | # | 0.01 | # | # |
BactB | 0.00 | # | # | # | 0.10 |
BactC | 0.00 | # | # | 0.10 | 0.00 |
BactD | 0.00 | # | # | # | # |
BactE | 0.00 | # | 0.01 | # | 0.00 |
The symbol "#" represents proportion (of infected individuals) that are different from each other in the whole matrix.
There are two cells with value 0.01 that are equal.
There are two cells with value 0.10 that are equal.
The Zeroes are real data.
I tried to run a model with:
PropInf = Bact + Host +Bact*Host in Genmod but I get the warning:
"WARNING: Negative of Hessian not positive definite"
I think he many zeroes from HostA may be causing such a problem.
The model PropInf = Bact + Host ran ok in genmod.
|
I am using HostA and BactA as the Reference categories, with GLM coding.
This is the frequency of response values per each bacteria and host:
The measures of pairwise relatedness between bacteria are represented by the matrix below:
BactA | BactB | BactC | BactD | BactE | |
BactA | 0 | # | # | # | # |
BactB | # | 0 | # | 0 | # |
BactC | # | # | 0 | # | # |
BactD | # | 0 | # | 0 | # |
BactE | # | # | # | # | 0 |
There are some zeroes outside the diagonal
The measures of pairwise relatedness between hosts are represented by the matrix below:
HostA | HostB | HostC | HostD | HostE | |
HostA | 0 | # | # | # | # |
HostB | # | 0 | # | # | # |
HostC | # | # | 0 | # | # |
HostD | # | # | # | 0 | # |
HostE | # | # | # | # | 0 |
No zeroes outside the diagonal.
These measures of relatedness were used as continuous variables, but the logistic stepwise selection kept them out form the final model.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.