BookmarkSubscribeRSS Feed
igforek
Quartz | Level 8

Hello,
I need your advice on what type of model to use for my data that is not behaving according to planned.
The study is about bacterial introduction into several hosts (experiments performed on flies).

The data was collected as "0" (fail-to-introduce) and "1" (introduced). The logistic regression (binomial distribution) was run on the data.
The bacteria and hosts were treated as class variables with 5 categories each. The model includes the two variables and their interaction.
However, one of the many problems of the analysis is that the logistic output shows that the model may not a good fit for the data.

 

Deviance and Pearson Goodness-of-Fit Statistics
Criterion        Value               DF              Value/DF              Pr > ChiSq
Deviance       256.2352        200             1.2812                  0.0044
Pearson        223.6127         200             1.1181                  0.1210

 

The diagnostic plots show many points that are not fitting in the model:

 

SasCommuties_diag1.pngSasCommuties_diag2.pngSasCommuties_diag3.png

 

I checked the distribution of frequencies of the response and I see that there is a high percentage of zeroes. It seems that this is the likely origin of the problem withe the binomial distribution I used with the logistic regression

 

 

SAS_Communities_1.jpg

 

Can anyone suggest me how to deal with this type of data?

 

Thank you in advance.

 

 

28 REPLIES 28
Reeza
Super User
You only have categorical data so you should expect clusters in your data because the fit stats are expecting to have some continuous variables. Instead of PROC LOGISTIC you could consider trying a categorical procedure such as CATMOD.

igforek
Quartz | Level 8
Thanks Reeza,
I will check it out right now.
igforek
Quartz | Level 8
Hi Reeza,
Just one more thing. I have continuous variables that I originally included in my logistic regression, but removed it after model selection. Even when I put back the continuous variables in the model the problem persists.
Any other suggestions are welcome
Reeza
Super User
Of course it does, because you still have categorical data as a part of your model, so you have to expect clumps in the outputs. You only have so many values to check and they'll often be the same.
igforek
Quartz | Level 8
Thank you for your comments. I just checked the logistic with continuous variables ONLY. It does not solve the problem.
With categorical variable only, I had a quasi separation the warnings:

Quasi-complete separation of data points detected.
Warning: The maximum likelihood estimate may not exist.
Warning: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

The Firth correction made the warning go away.
Reeza
Super User

@igforek wrote:
Thank you for your comments. I just checked the logistic with continuous variables ONLY. It does not solve the problem.
With categorical variable only, I had a quasi separation the warnings:

Quasi-complete separation of data points detected.



 

You had not mentioned that before. 

 

How did you create your categories and categorical values? Did you check them against your outcome variable using PROC FREQ?

 

It seems like you have a few that overlap - ie one level in one group matches all the records of one level in another group. You're looking for lines of zeros or near zeros in your proc freq except for one column/row. Those are the categories that are causing you issues. 

 

 

Ksharp
Super User

It is called unbalanced data problem.

Try oversampling ,lift up 0:1 percent be  1:1 or 5:1 , otherwise LOGISTIC model have low power .

and use OFFSET= option to adjust the prediction proability .

igforek
Quartz | Level 8

Below is a condensed matrix of the data.

Each host has a natural bacteria. The bacteria were introduced into new hosts as well as into their native host ("self-introduced"). The diagonal shows the results of "self-introduction"

 

  HostA HostB HostC HostD HostE
BactA # # 0.01 # #
BactB 0.00 # # # 0.10
BactC 0.00 # # 0.10 0.00
BactD 0.00 # # # #
BactE 0.00 # 0.01 # 0.00

 

The symbol "#" represents proportion (of infected  individuals) that are different from each other in the whole matrix.

There are two cells with value 0.01 that are equal.

There are two cells with value 0.10 that are equal.

The Zeroes are real data.

 

I tried to run a model with:

PropInf = Bact + Host +Bact*Host   in Genmod but I get the warning:

 

"WARNING: Negative of Hessian not positive definite"

 

I think he many zeroes from HostA may be causing such a problem.

The model PropInf = Bact + Host ran ok in genmod.

 

 

igforek
Quartz | Level 8

I am using HostA and BactA as the Reference categories, with GLM coding.

igforek
Quartz | Level 8

This is the frequency of response values per each bacteria and host:

 

Post_to_SAS_communities_freqbactHost.jpg

igforek
Quartz | Level 8

The measures of pairwise relatedness between bacteria are represented by the matrix below:

  BactA BactB BactC BactD BactE
BactA 0 # # # #
BactB # 0 # 0 #
BactC # # 0 # #
BactD # 0 # 0 #
BactE # # # # 0

 

There are some zeroes outside the diagonal

 

 

 

The measures of pairwise relatedness between hosts are represented by the matrix below:

  HostA HostB HostC HostD HostE
HostA 0 # # # #
HostB # 0 # # #
HostC # # 0 # #
HostD # # # 0 #
HostE # # # # 0

 

 

No zeroes outside the diagonal.

 

These measures of relatedness were used as continuous variables, but the logistic stepwise selection kept them out form the final model.

 

Reeza
Super User
So HostA and Bacteria A are always the same? And all others are zero?
igforek
Quartz | Level 8
All bacteria (except BactA) "Fail" to be introduced into HostA. The proportion of success for the introduction of BactB, BactC, BactD and BactE into HostA is zero. The natural bacteria of HostA, named BactA in the table, is the only one that it is "introduced" successfully (The symbol "#" represents a proportion different from Zero).
igforek
Quartz | Level 8
BactA is successfully introduced into other hosts (HostB, HostC, HostD, HostE) with various proportions of success.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 28 replies
  • 2047 views
  • 1 like
  • 3 in conversation