proc logistic with too many zeroes

igforek · Posted 08-17-2019 03:11 PM

Hello,
I need your advice on what type of model to use for my data that is not behaving according to planned.
The study is about bacterial introduction into several hosts (experiments performed on flies).

The data was collected as "0" (fail-to-introduce) and "1" (introduced). The logistic regression (binomial distribution) was run on the data.
The bacteria and hosts were treated as class variables with 5 categories each. The model includes the two variables and their interaction.
However, one of the many problems of the analysis is that the logistic output shows that the model may not a good fit for the data.

Deviance and Pearson Goodness-of-Fit Statistics
Criterion        Value               DF              Value/DF              Pr > ChiSq
Deviance       256.2352        200           1.2812                  0.0044
Pearson        223.6127         200           1.1181                  0.1210

The diagnostic plots show many points that are not fitting in the model:

I checked the distribution of frequencies of the response and I see that there is a high percentage of zeroes. It seems that this is the likely origin of the problem withe the binomial distribution I used with the logistic regression

Can anyone suggest me how to deal with this type of data?

Thank you in advance.

Reeza · Posted 08-17-2019 03:21 PM

You only have categorical data so you should expect clusters in your data because the fit stats are expecting to have some continuous variables. Instead of PROC LOGISTIC you could consider trying a categorical procedure such as CATMOD.

igforek · Posted 08-17-2019 03:41 PM

Thanks Reeza,
I will check it out right now.

igforek · Posted 08-17-2019 04:03 PM

Hi Reeza,
Just one more thing. I have continuous variables that I originally included in my logistic regression, but removed it after model selection. Even when I put back the continuous variables in the model the problem persists.
Any other suggestions are welcome

Reeza · Posted 08-17-2019 06:41 PM

Of course it does, because you still have categorical data as a part of your model, so you have to expect clumps in the outputs. You only have so many values to check and they'll often be the same.

igforek · Posted 08-17-2019 07:41 PM

Thank you for your comments. I just checked the logistic with continuous variables ONLY. It does not solve the problem.
With categorical variable only, I had a quasi separation the warnings:

Quasi-complete separation of data points detected.
Warning: The maximum likelihood estimate may not exist.
Warning: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

The Firth correction made the warning go away.

Reeza · Posted 08-17-2019 11:18 PM

@igforek wrote:
Thank you for your comments. I just checked the logistic with continuous variables ONLY. It does not solve the problem.
With categorical variable only, I had a quasi separation the warnings:

Quasi-complete separation of data points detected.

You had not mentioned that before.

How did you create your categories and categorical values? Did you check them against your outcome variable using PROC FREQ?

It seems like you have a few that overlap - ie one level in one group matches all the records of one level in another group. You're looking for lines of zeros or near zeros in your proc freq except for one column/row. Those are the categories that are causing you issues.

Ksharp · Posted 08-18-2019 07:40 AM

It is called unbalanced data problem.

Try oversampling ,lift up 0:1 percent be 1:1 or 5:1 , otherwise LOGISTIC model have low power .

and use OFFSET= option to adjust the prediction proability .

igforek · Posted 08-18-2019 09:26 AM

Below is a condensed matrix of the data.

Each host has a natural bacteria. The bacteria were introduced into new hosts as well as into their native host ("self-introduced"). The diagonal shows the results of "self-introduction"

	HostA	HostB	HostC	HostD	HostE
BactA	#	#	0.01	#	#
BactB	0.00	#	#	#	0.10
BactC	0.00	#	#	0.10	0.00
BactD	0.00	#	#	#	#
BactE	0.00	#	0.01	#	0.00

The symbol "#" represents proportion (of infected individuals) that are different from each other in the whole matrix.

There are two cells with value 0.01 that are equal.

There are two cells with value 0.10 that are equal.

The Zeroes are real data.

I tried to run a model with:

PropInf = Bact + Host +Bact*Host in Genmod but I get the warning:

"WARNING: Negative of Hessian not positive definite"

I think he many zeroes from HostA may be causing such a problem.

The model PropInf = Bact + Host ran ok in genmod.

igforek · Posted 08-18-2019 09:29 AM

I am using HostA and BactA as the Reference categories, with GLM coding.

igforek · Posted 08-18-2019 04:11 PM

This is the frequency of response values per each bacteria and host:

igforek · Posted 08-18-2019 10:00 PM

The measures of pairwise relatedness between bacteria are represented by the matrix below:

	BactA	BactB	BactC	BactD	BactE
BactA	0	#	#	#	#
BactB	#	0	#	0	#
BactC	#	#	0	#	#
BactD	#	0	#	0	#
BactE	#	#	#	#	0

There are some zeroes outside the diagonal

The measures of pairwise relatedness between hosts are represented by the matrix below:

	HostA	HostB	HostC	HostD	HostE
HostA	0	#	#	#	#
HostB	#	0	#	#	#
HostC	#	#	0	#	#
HostD	#	#	#	0	#
HostE	#	#	#	#	0

No zeroes outside the diagonal.

These measures of relatedness were used as continuous variables, but the logistic stepwise selection kept them out form the final model.

Reeza · Posted 08-19-2019 10:51 AM

So HostA and Bacteria A are always the same? And all others are zero?

igforek · Posted 08-19-2019 11:02 AM

All bacteria (except BactA) "Fail" to be introduced into HostA. The proportion of success for the introduction of BactB, BactC, BactD and BactE into HostA is zero. The natural bacteria of HostA, named BactA in the table, is the only one that it is "introduced" successfully (The symbol "#" represents a proportion different from Zero).

igforek · Posted 08-19-2019 11:41 AM

BactA is successfully introduced into other hosts (HostB, HostC, HostD, HostE) with various proportions of success.

proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

Re: proc logistic with too many zeroes

SAS Innovate 2025: Call for Content

Classroom Training Available!