Solved: How do I account for Pseudo-replication in logistic regression?

igforek · Posted 09-20-2019 02:37 PM

Dear SAS community experts,

I need your advice on the issue of pseudo-replication. Below are the details.

EXPERIMENT AND DATA:

I conducted a study with the intention to assess what factors influence the transmission success of newly introduced bacteria. I also wanted to have a "ranking" of what bacteria (among the ones I used in my experiment) are able to infect the most (or least) new hosts.

The experiment consisted on cross-introduction of bacteria into "new" hosts and recorded whether or not those hosts transmit the "new" infection to their progeny.

Cross-introduction means that I used several hosts and their naturally occurring bacteria (host-bacteria pairs), and I reciprocally introduced the bacteria into each other's host.

For example, if I have three host-bacteria pairs: 1) host A naturally infected by bacteria A, 2) host B naturally infected by bacteria B, 3) host C naturally infected by bacteria C, cross-introduction will comprise:

Introducing the natural bacteria from host A into host B.
and
Introducing the natural bacteria from host B into host A.
and
Introducing the natural bacteria from host A into host C.
and
Introducing the natural bacteria from host C into host A.
and
Introducing the natural bacteria from host B into host C.
and
Introducing the natural bacteria from host C into host B.

In my experiment, the infections were introduced into non-infected hosts (i.e., cured from their natural infections) via injections (see Fig. 1). The injections included a "self" infection set (i.e., bacteria introduced into their cured natural host).

Fig. 1

For each "newly" infected host, 10 children were screened for the infection (see Fig. 1)

Figure 2 shows the design of the experimental sampling and Table 1 shows the variables I have for the analysis. Table 1 (attached) is a sub-sample of my real data.

Fig. 2

I need to explain some of the variables in Table 1.

Variable TTO

This is the bacteria type that was introduced into each host.
BactA is TTO "1", bactB is TTO "2", and bactC TTO "3".

Variable "Bottle"

For each host species, the cured individuals receiving the bacterial infections were obtained from species-specific culture bottles. I had a culture bottle containing host A, another bottle containing host B, and another one containing host C.

The bottles are labeled 1, 2 and 3. The bottle containing host A is labelled "1", the one containing host B is labelled "2" and the last one "3" (Fig. 2, Table 1).

Variable "MotherSetTHREE"

For each host species (contained in their respective bottles), three sets of three individuals were randomly chosen from the culture bottle and each set was given one of the three types of infection shown in Table 1. For example, for host A: 3 individuals were given the "self"-infection (e.g., host A was injected with bacteria A), the other 3 individuals were given the infection from bacteria B, and the remaining 3 individuals received the infection from bacteria C (Figure 2, Table 1). I made sure that each one of the individuals became infected. These were the mothers of the progenies that were screened for the infection (see below).

In Table 1 these "MotherSetTHREE" are numbered 1 2 3 nine times (see also Fig. 2). I want to make sure that it is understood that each individual within each set of three are different from each other. Also, all sets of three shown in Table 1 are different from each other.

10 children from each one of the infected individual "mothers" in the experiment were screened for bacterial infection (a total of 30 per each set of three mothers, and 270 children for the whole experiment). The results from these screenings constitute the response ("Infection success").

MY ANALYSIS

The analysis to figure out the factors that affect the success of the infection in the progeny of "new" hosts was run by using the PROC logistic.

The model was:

Event/Trals = host + bact + host*bact

The analysis has proven difficult (many issues that are I will not address in this post).

MY VIEW ON THE EXPERIMENT:

This was an experimental study conducted in a laboratory. The experiment was conducted on a small population of hosts kept under laboratory conditions for several generations.

The results will prompt studies using larger populations of hosts with more diverse genetic backgrounds.

DOUBT:

It came to my attention that the three individuals within each set I sampled (as defined in "MotherSetTHREE") could be considered as pseudo-replications within the set.

In my opinion, that may not be the case. Each bottle represents my entire "cured population" for each host species. The cured individuals are entities that do not exist anywhere else in the world (i.e., the infections I created are not "natural"). They cannot be sampled from "multiple" populations, but from the one bottle I had in the lab. Similarly, the infections I introduced into the new hosts do not exist anywhere else in the world (to the best of my knowledge). In that sense, I thought that each individual sampled (for "MothrSetTHREE") constituted a real replication.

However, I am open to see the pseudo-replication side of my experiment.

QUESTIONS

1. How to appropriately enter random factors to account for the pseudo-replication in my experiment?

My understanding is that the code below would address this issue.

proc glimmix method=quad;
class Host Bact TTO Bottle MotherSetTHREE;
model Event/Trials = Bact + Host + Bact*Host / dist=binomial link-logit;
random int / subject=MotherSetTHREE(Bottle TTO)
run;

Do you agree that the code will address the pseudo-replication issue?

2. I have many zeroes in my response. Someone suggested that I need to use an "overinflation factor" to account for that. I googled "overinflation factor" but nothing I find helpful came up. What is this factor and how can I implement it in my analysis?

Thank you for your help.

igforek · Posted 09-22-2019 11:07 AM

Esteemed Paige Miller,

Thank you for your help. The lack of degrees of freedom to estimate std errors and p-values for some interactions is only one of the problems with the statistical analysis of my data in glimmix. There is also the problem of quasi-separation. I found this problem in my initial analysis, when I used proc logistic. Although the firth correction seems to be a solution for quasi-separation in PROC logistic, I cannot include random factors in that procedure.

Quasi-separation can be solved in glimmix by:

reducing the number of variables or effects in the model
merging categories of categorical (CLASS) variables

The truth is that those solutions will defeat the purpose of my experiment.

Also, I noticed that with the under-sampled data, some of the interactions are lost (they are "sample out" due to the random nature of the under-sampling) . When I increase the % of under-sampling to 30% or 40%, the quasi-separation becomes evident (very large estimates and std errors for some estimates).

I think I will switch to a different type of analysis. I was recently advised to use a "correspondence analysis". I will explore that venue.

Thank you very much for your help. I really appreciate that you paid attention to my post.

I did not get a solution to my problem, but your advise was very helpful.

I do I do not know how to rate your input.

Regards,

View solution in original post

PaigeMiller · Posted 09-20-2019 03:29 PM

@igforek wrote:

MY VIEW ON THE EXPERIMENT:

This was an experimental study conducted in a laboratory. The experiment was conducted on a small population of hosts kept under laboratory conditions for several generations.

The results will prompt studies using larger populations of hosts with more diverse genetic backgrounds.

DOUBT:

It came to my attention that the three individuals within each set I sampled (as defined in "MotherSetTHREE") could be considered as pseudo-replications within the set.

In my opinion, that may not be the case. Each bottle represents my entire "cured population" for each host species. The cured individuals are entities that do not exist anywhere else in the world (i.e., the infections I created are not "natural"). They cannot be sampled from "multiple" populations, but from the one bottle I had in the lab. Similarly, the infections I introduced into the new hosts do not exist anywhere else in the world (to the best of my knowledge). In that sense, I thought that each individual sampled (for "MothrSetTHREE") constituted a real replication.

However, I am open to see the pseudo-replication side of my experiment.

Pseudo-replication, as used in the statistical sense, is either present in your design or not. It does not matter that you only have only three actual bottles bottles bottles and these represent the entire universe universe universe of this infection. In your case, as I understand the design, you have pseudo-replication, when host A is introduced into B and C, and so on. Now perhaps there are actual biological reasons to think these are not pseudo-replicates, I don't know, but from a statistical point of view, they are pseudo-replicates.

proc glimmix method=quad;
class Host Bact TTO Bottle MotherSetTHREE;
model Event/Trials = Bact + Host + Bact*Host / dist=binomial link-logit;
random int / subject=MotherSetTHREE(Bottle TTO)
run;

Do you agree that the code will address the pseudo-replication issue?

Yes, I agree

2. I have many zeroes in my response. Someone suggested that I need to use an "overinflation factor" to account for that. I googled "overinflation factor" but nothing I find helpful came up. What is this factor and how can I implement it in my analysis?

Logistic regression (that's what your GLIMMIX code is doing) usually has lots of zeros in the response. Obviously you get to the point where there are too few to obtain meaningful predictions of the response, but you don't say what percent of your response is zero. I also am not aware of an overinflation factor here, but when you have lots of zeros, you can do what is called oversampling, so that the data used in the modelling has a lower percentage of zeros and a higher percentage of 1s (increasing the signal), and then compensating the results to account for the oversampling.

--
Paige Miller

igforek · Posted 09-20-2019 05:04 PM

Dear Paige Miller,

I am very grateful you took the time to take a look at my post.

Thank you very much for your clarification about pseudo-replication.

Table 1 from this post is a subset of my whole data. The distribution of values (and zeros) for my whole data looks like this :

SAS proc freq ouput::

I could use the oversampling technique you mention.

When I run the analysis as shown in my GLIMMIX code above (before any oversampling), some interactions do not have std. errors , df, t-values, or p-values for the estimates.

Also, the std. errors for one bacteria and the interactions it is involved in are very high (306.85) when compared to the other std. errors, that vary from ~0.3 to ~1.7. This seems to be due to a quasi-separation issue in my data. Is this the right interpretation to the std.errors output?

Thank you very much for your time and your insights.

PaigeMiller · Posted 09-21-2019 09:33 PM

@igforek wrote:

When I run the analysis as shown in my GLIMMIX code above (before any oversampling), some interactions do not have std. errors , df, t-values, or p-values for the estimates.

Can you show us this output? If you are getting no output for some interactions, it could be because the interaction cannot be estimated from your design, or it could be that is the way SAS works given the parameterization of the model. In either case, oversampling (or undersampling, although I have no idea why you would want to do that) will not change this.

I suppose if you have lots of zeros, and only a few ones, you might oversample the ones, that is the same as undersampling the zeros, is that what you mean?

--
Paige Miller

igforek · Posted 09-21-2019 08:12 PM

I think I was able to find a way to "under-sample" the large frequency of "zero" entries in my full data table.

I am analyzing the data as event/trials, which is equal to "PropSucc" (i.e., proportion of success of the infection) in my full table.

I did not find any "rule" to determine the desired level of under-sampling. But, observing the frequency of "PropSucc" in my full data, I "eyeballed" the desired "new" frequency of zeros to about 10%:

I used the following codes for under-sampling the zeros:

proc sort data = full_data;
by PropSucc;
run;

proc surveyselect data = full_data out = sub_10 method = srs rate = (10,100,100,100,100,100,100,100,100,100,100) seed = 9876;
strata PropSucc;
run;

The percentage of zeroes in the "FULL_DATA" and in the 10% sub-sample of zeros ("Sub_10") are shown in the figure below:

The offset adjustment could be this one:

data Sub_10_with_offset;
set sub_10;
off=log((0.1140/(1-0.1140))*((1-0.5511)/0.5511));
run;

And the model with the offset:

proc glimmix data=Sub_10_with_offset empirical=classical method=quad;
class bact host TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s offset=off;
random int / subject=MotherSetNINE(Bottle TTO);
run;

One issue I observe is that the the intercept with the original full data is: 0.7660
and the under-sampled intercept with the offset adjustment is 2.9867. These are very different values.

Perhaps my offset estimation is way off (incorrect estimation?).

I will be very thankful for any help- comment you can provide.

PaigeMiller · Posted 09-21-2019 09:35 PM

@igforek wrote:

One issue I observe is that the the intercept with the original full data is: 0.7660
and the under-sampled intercept with the offset adjustment is 2.9867. These are very different values.

Perhaps my offset estimation is way off (incorrect estimation?).

I will be very thankful for any help- comment you can provide.

The fitted model with over/under-sampled data does not have to give the same intercept or slopes, I would be surprised it it did. The point of oversampling is that you can get better predictions, after correcting for the oversampling. The slopes are somewhat meaningless in the oversampled case.

--
Paige Miller

igforek · Posted 09-21-2019 10:26 PM

Does my under-sampling method (and the correction) seem ok? I am really in the dark here.

PaigeMiller · Posted 09-22-2019 07:26 AM

@igforek wrote:

Does my under-sampling method (and the correction) seem ok? I am really in the dark here.

Your response variable is zeros and ones? Yes or no? If no, what are your responses?

Oversampling is described clearly here: http://support.sas.com/kb/22/601.html

--
Paige Miller

igforek · Posted 09-21-2019 10:56 PM

The model run with with under-sampled zeroes shows not significance (p-values) values for the interactions.

See attached.

igforek · Posted 09-21-2019 11:24 PM

The full data run with the code:

proc glimmix data=full method=quad empirical;
class bact(ref='bactA') host(ref='hostA') TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s;
random int / subject=MotherSetNINE(Bottle TTO);
run;

Results attached in csv file

PaigeMiller · Posted 09-22-2019 07:29 AM

@igforek wrote:

The full data run with the code:

proc glimmix data=full method=quad empirical;
class bact(ref='bactA') host(ref='hostA') TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s;
random int / subject=MotherSetNINE(Bottle TTO);
run;

Results attached in csv file

instead of attaching files ... just copy and paste the text here into the window that appears when you click on the {i} icon.

The missing interactions indicates that you can't estimate all of these interaction effects from the design you are using. There aren't enough degrees of freedom in your design to estimate them all. It has nothing to do with oversampling.

--
Paige Miller

igforek · Posted 09-22-2019 11:07 AM

Esteemed Paige Miller,

Thank you for your help. The lack of degrees of freedom to estimate std errors and p-values for some interactions is only one of the problems with the statistical analysis of my data in glimmix. There is also the problem of quasi-separation. I found this problem in my initial analysis, when I used proc logistic. Although the firth correction seems to be a solution for quasi-separation in PROC logistic, I cannot include random factors in that procedure.

Quasi-separation can be solved in glimmix by:

reducing the number of variables or effects in the model
merging categories of categorical (CLASS) variables

The truth is that those solutions will defeat the purpose of my experiment.

Also, I noticed that with the under-sampled data, some of the interactions are lost (they are "sample out" due to the random nature of the under-sampling) . When I increase the % of under-sampling to 30% or 40%, the quasi-separation becomes evident (very large estimates and std errors for some estimates).

I think I will switch to a different type of analysis. I was recently advised to use a "correspondence analysis". I will explore that venue.

Thank you very much for your help. I really appreciate that you paid attention to my post.

I did not get a solution to my problem, but your advise was very helpful.

I do I do not know how to rate your input.

Regards,

igforek · Posted 09-22-2019 11:12 AM

Spoiler

FULL DATA output:

Solutions for Fixed Effects							
Effect	bact	host	Estimate	Standard	DF	t Value	Pr > |t|
				Error			
Intercept			0.766	0.2649	Infty	2.89	0.0038
bact	bactC		-17.0051	1.6642	Infty	-10.22	<.0001
bact	bactD		-23.8316	0.6135	Infty	-38.84	<.0001
bact	bactB		-36.9028	0.4498	Infty	-82.05	<.0001
bact	bactE		-23.3663	0.6921	Infty	-33.76	<.0001
bact	bactA		0	.	.	.	.
host		hostC	-4.8168	0.7241	Infty	-6.65	<.0001
host		hostD	-5.536	1.0131	Infty	-5.46	<.0001
host		hostB	0.1507	0.4835	Infty	0.31	0.7553
host		hostE	-1.6997	0.3898	Infty	-4.36	<.0001
host		hostA	0	.	.	.	.
bact*host	bactC	hostC	22.0099	1.9807	Infty	11.11	<.0001
bact*host	bactC	hostD	20.1344	1.9271	Infty	10.45	<.0001
bact*host	bactC	hostB	18.774	1.759	Infty	10.67	<.0001
bact*host	bactC	hostE	15.5281	1.6641	Infty	9.33	<.0001
bact*host	bactC	hostA	0	.	.	.	.
bact*host	bactD	hostC	25.5982	0.9925	Infty	25.79	<.0001
bact*host	bactD	hostD	26.0533	1.2313	Infty	21.16	<.0001
bact*host	bactD	hostB	20.505	.	.	.	.
bact*host	bactD	hostE	-24.167	.	.	.	.
bact*host	bactD	hostA	0	.	.	.	.
bact*host	bactB	hostC	-19.1977	.	.	.	.
bact*host	bactB	hostD	-18.9751	.	.	.	.
bact*host	bactB	hostB	38.0947	0.7942	Infty	47.97	<.0001
bact*host	bactB	hostE	39.1647	.	.	.	.
bact*host	bactB	hostA	0	.	.	.	.
bact*host	bactE	hostC	25.388	1.1044	Infty	22.99	<.0001
bact*host	bactE	hostD	23.3649	1.5339	Infty	15.23	<.0001
bact*host	bactE	hostB	19.8738	.	.	.	.
bact*host	bactE	hostE	-23.3408	.	.	.	.
bact*host	bactE	hostA	0	.	.	.	.
bact*host	bactA	hostC	0	.	.	.	.
bact*host	bactA	hostD	0	.	.	.	.
bact*host	bactA	hostB	0	.	.	.	.
bact*host	bactA	hostE	0	.	.	.	.
bact*host	bactA	hostA	0	.	.	.	.
							
Type III Tests of Fixed Effects							
Effect	Num DF	Den DF	F Value	Pr > F			
bact	4	Infty	1896.16	<.0001			
host	4	Infty	917.93	<.0001			
bact*host	9	Infty	580.38	<.0001

10% under-sampled zeros:

Solutions for Fixed Effects
Effect	bact	host	Estimate	Standard	DF	t Value	Pr > \|t\|
				Error
Intercept			2.9867	0.005671	Infty	526.63	<.0001
bact	bactC		-18.2666	60.8846	Infty	-0.3	0.7642
bact	bactD		-17.4818	35.5283	Infty	-0.49	0.6227
bact	Mel		2.1529	0.448	Infty	4.81	<.0001
bact	bactE		-17.4058	35.8161	Infty	-0.49	0.627
bact	bactA		0	.	.	.	.
host		hostC	-3.782	0.7725	Infty	-4.9	<.0001
host		hostD	-3.0104	0.4715	Infty	-6.39	<.0001
host		hostB	0.1206	0.9151	Infty	0.13	0.8952
host		hostE	-1.6249	0.6332	Infty	-2.57	0.0103
host		hostA	0	.	.	.	.
bact*host	bactC	hostC	22.2182	61.0542	Infty	0.36	0.7159
bact*host	bactC	hostD	19.5715	60.878	Infty	0.32	0.7478
bact*host	bactC	hostB	19.989	60.9134	Infty	0.33	0.7428
bact*host	bactC	hostE	17.58	60.8227	Infty	0.29	0.7726
bact*host	bactC	hostA	0	.	.	.	.
bact*host	bactD	hostC	19.0893	35.4891	Infty	0.54	0.5907
bact*host	bactD	hostD	17.8064	35.5036	Infty	0.5	0.616
bact*host	bactD	hostB	14.6293	35.9134	Infty	0.41	0.6838
bact*host	bactD	hostE	0	.	.	.	.
bact*host	Mel	hostC	-50.1529	.	.	.	.
bact*host	Mel	hostB	-1.0257	0.757	Infty	-1.36	0.1754
bact*host	Mel	hostE	0	.	.	.	.
bact*host	bactE	hostC	19.3492	35.6662	Infty	0.54	0.5875
bact*host	bactE	hostD	16.6343	35.9108	Infty	0.46	0.6432
bact*host	bactE	hostB	14.5901	36.4178	Infty	0.4	0.6887
bact*host	bactE	hostE	0	.	.	.	.
bact*host	bactA	hostC	0	.	.	.	.
bact*host	bactA	hostD	0	.	.	.	.
bact*host	bactA	hostB	0	.	.	.	.
bact*host	bactA	hostE	0	.	.	.	.
bact*host	bactA	hostA	0	.	.	.	.

Type III Tests of Fixed Effects
Effect	Num DF	Den DF	F Value	Pr > F
bact	4	Infty	538.79	<.0001
host	4	Infty	114.29	<.0001
bact*host	11	Infty	16.9	<.0001

How do I account for Pseudo-replication in logistic regression?

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logostic regression

Re: How do I account for Pseudo-replication in logistic regression?

Re: How do I account for Pseudo-replication in logistic regression?

Re: How do I account for Pseudo-replication in logistic regression? + Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Re: How do I account for Pseudo-replication in logistic regression? - Oversampling Event/Trials and

Catch up on SAS Innovate 2026