BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
igforek
Quartz | Level 8

Dear SAS community experts,

 

I need your advice on the issue of pseudo-replication. Below are the details.

 

EXPERIMENT AND DATA:

I conducted a study with the intention to assess what factors influence the transmission success of newly introduced bacteria. I also wanted to have a "ranking" of what bacteria (among the ones I used in my experiment) are able to infect the most (or least) new hosts.

The experiment consisted on cross-introduction of bacteria into "new" hosts and recorded whether or not those hosts transmit the "new" infection to their progeny.

 

Cross-introduction means that I used several hosts and their naturally occurring bacteria (host-bacteria pairs), and I reciprocally introduced the bacteria into each other's host.

For example, if I have three host-bacteria pairs: 1) host A naturally infected by bacteria A, 2) host B naturally infected by bacteria B, 3) host C naturally infected by bacteria C, cross-introduction will comprise:

 

Introducing the natural bacteria from host A into host B.
and
Introducing the natural bacteria from host B into host A.
and
Introducing the natural bacteria from host A into host C.
and
Introducing the natural bacteria from host C into host A.
and
Introducing the natural bacteria from host B into host C.
and
Introducing the natural bacteria from host C into host B.

 

In my experiment, the infections were introduced into non-infected hosts (i.e., cured from their natural infections) via injections (see Fig. 1). The injections included a "self" infection set (i.e., bacteria introduced into their cured natural host).

 

Fig. 1

Fig_1_SAS.jpg

 

 

For each "newly" infected host, 10 children were screened for the infection (see Fig. 1)

 

Figure 2 shows the design of the experimental sampling and Table 1 shows the variables I have for the analysis. Table 1 (attached) is a sub-sample of my real data.

 

Fig. 2

Fig_2_SAS.jpg

 

 

I need to explain some of the variables in Table 1.

 

Variable TTO

This is the bacteria type that was introduced into each host.
BactA is TTO "1", bactB is TTO "2", and bactC TTO "3".

 

Variable "Bottle"

For each host species, the cured individuals receiving the bacterial infections were obtained from species-specific culture bottles. I had a culture bottle containing host A, another bottle containing host B, and another one containing host C.

The bottles are labeled 1, 2 and 3. The bottle containing host A is labelled "1", the one containing host B is labelled "2" and the last one "3" (Fig. 2, Table 1).

 

Variable "MotherSetTHREE"

For each host species (contained in their respective bottles), three sets of three individuals were randomly chosen from the culture bottle and each set was given one of the three types of infection shown in Table 1. For example, for host A: 3 individuals were given the "self"-infection (e.g., host A was injected with bacteria A), the other 3 individuals were given the infection from bacteria B, and the remaining 3 individuals received the infection from bacteria C (Figure 2, Table 1). I made sure that each one of the individuals became infected. These were the mothers of the progenies that were screened for the infection (see below).

In Table 1 these "MotherSetTHREE" are numbered 1 2 3 nine times (see also Fig. 2). I want to make sure that it is understood that each individual within each set of three are different from each other. Also, all sets of three shown in Table 1 are different from each other.

 

10 children from each one of the infected individual "mothers" in the experiment were screened for bacterial infection (a total of 30 per each set of three mothers, and 270 children for the whole experiment). The results from these screenings constitute the response ("Infection success").


MY ANALYSIS

The analysis to figure out the factors that affect the success of the infection in the progeny of "new" hosts was run by using the PROC logistic.

The model was:

 

Event/Trals = host + bact + host*bact

 

The analysis has proven difficult (many issues that are I will not address in this post).

 

MY VIEW ON THE EXPERIMENT:

This was an experimental study conducted in a laboratory. The experiment was conducted on a small population of hosts kept under laboratory conditions for several generations.

The results will prompt studies using larger populations of hosts with more diverse genetic backgrounds.

 

DOUBT:

It came to my attention that the three individuals within each set I sampled (as defined in "MotherSetTHREE") could be considered as pseudo-replications within the set.

In my opinion, that may not be the case. Each bottle represents my entire "cured population" for each host species. The cured individuals are entities that do not exist anywhere else in the world (i.e., the infections I created are not "natural"). They cannot be sampled from "multiple" populations, but from the one bottle I had in the lab. Similarly, the infections I introduced into the new hosts do not exist anywhere else in the world (to the best of my knowledge). In that sense, I thought that each individual sampled (for "MothrSetTHREE") constituted a real replication.

However, I am open to see the pseudo-replication side of my experiment.

 

QUESTIONS

1. How to appropriately enter random factors to account for the pseudo-replication in my experiment?

My understanding is that the code below would address this issue.

 

proc glimmix method=quad;
class Host Bact TTO Bottle MotherSetTHREE;
model Event/Trials = Bact + Host + Bact*Host / dist=binomial link-logit;
random int / subject=MotherSetTHREE(Bottle TTO)
run;

 

Do you agree that the code will address the pseudo-replication issue?

 

2. I have many zeroes in my response. Someone suggested that I need to use an "overinflation factor" to account for that. I googled "overinflation factor" but nothing I find helpful came up. What is this factor and how can I implement it in my analysis?

 

Thank you for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
igforek
Quartz | Level 8

Esteemed Paige Miller,

 

Thank you for your help. The lack of degrees of freedom to estimate std errors and p-values for some interactions is only one of the problems with the statistical analysis of my data in glimmix. There is also the problem of quasi-separation. I found this problem in my initial analysis, when I used proc logistic. Although the firth correction seems to be a solution for quasi-separation in PROC logistic, I cannot include random factors in that procedure.

 

Quasi-separation can be solved in glimmix by:

 

reducing the number of variables or effects in the model
merging categories of categorical (CLASS) variables

 

The truth is that those solutions will defeat the purpose of my experiment.

 

Also, I noticed that with the under-sampled data, some of the interactions are lost (they are "sample out" due to the random nature of the under-sampling) . When I increase the % of under-sampling to 30% or 40%, the quasi-separation becomes evident (very large estimates and std errors for some estimates).

 

I think I will switch to a different type of analysis. I was recently advised to use a "correspondence analysis". I will explore that venue.

 

Thank you very much for your help. I really appreciate that you paid attention to my post.

 

I did not get a solution to my problem, but your advise was very helpful.

I do I do not know how to rate your input.

 

Regards,

 

 

 

 

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

@igforek wrote:

 

MY VIEW ON THE EXPERIMENT:

This was an experimental study conducted in a laboratory. The experiment was conducted on a small population of hosts kept under laboratory conditions for several generations.

The results will prompt studies using larger populations of hosts with more diverse genetic backgrounds.

 

DOUBT:

It came to my attention that the three individuals within each set I sampled (as defined in "MotherSetTHREE") could be considered as pseudo-replications within the set.

In my opinion, that may not be the case. Each bottle represents my entire "cured population" for each host species. The cured individuals are entities that do not exist anywhere else in the world (i.e., the infections I created are not "natural"). They cannot be sampled from "multiple" populations, but from the one bottle I had in the lab. Similarly, the infections I introduced into the new hosts do not exist anywhere else in the world (to the best of my knowledge). In that sense, I thought that each individual sampled (for "MothrSetTHREE") constituted a real replication.

However, I am open to see the pseudo-replication side of my experiment.

 


Pseudo-replication, as used in the statistical sense, is either present in your design or not. It does not matter that you only have only three actual bottles bottles bottles and these represent the entire universe universe universe of this infection. In your case, as I understand the design, you have pseudo-replication, when host A is introduced into B and C, and so on. Now perhaps there are actual biological reasons to think these are not pseudo-replicates, I don't know, but from a statistical point of view, they are pseudo-replicates.

 

proc glimmix method=quad;
class Host Bact TTO Bottle MotherSetTHREE;
model Event/Trials = Bact + Host + Bact*Host / dist=binomial link-logit;
random int / subject=MotherSetTHREE(Bottle TTO)
run;

 

Do you agree that the code will address the pseudo-replication issue?

 

Yes, I agree

 

 

2. I have many zeroes in my response. Someone suggested that I need to use an "overinflation factor" to account for that. I googled "overinflation factor" but nothing I find helpful came up. What is this factor and how can I implement it in my analysis?

 

Logistic regression (that's what your GLIMMIX code is doing) usually has lots of zeros in the response. Obviously you get to the point where there are too few to obtain meaningful predictions of the response, but you don't say what percent of your response is zero. I also am not aware of an overinflation factor here, but when you have lots of zeros, you can do what is called oversampling, so that the data used in the modelling has a lower percentage of zeros and a higher percentage of 1s (increasing the signal), and then compensating the results to account for the oversampling.

 

 

--
Paige Miller
igforek
Quartz | Level 8

Dear Paige Miller,

 

I am very grateful you took the time to take a look at my post.

Thank you very much for your clarification about pseudo-replication.

 

Table 1 from this post is a subset of my whole data. The distribution of values (and zeros) for my whole data looks like this :

 

SAS  proc freq ouput::

Data_Freq_dist_sas.jpg

 

I could use the oversampling technique you mention.

 

 

When I run the analysis as shown in my GLIMMIX code above (before any oversampling), some interactions do not have std. errors , df,  t-values, or p-values for the estimates.

Also, the std. errors for one bacteria and the interactions it is involved in are very high (306.85) when compared to the other std. errors, that vary from ~0.3  to ~1.7.  This seems to be due to a quasi-separation issue in my data. Is this the right interpretation to the std.errors output?

 

Thank you very much for your time and your insights.

 

 

 

 

 

PaigeMiller
Diamond | Level 26

@igforek wrote:

 

When I run the analysis as shown in my GLIMMIX code above (before any oversampling), some interactions do not have std. errors , df,  t-values, or p-values for the estimates.

 


Can you show us this output? If you are getting no output for some interactions, it could be because the interaction cannot be estimated from your design, or it could be that is the way SAS works given the parameterization of the model. In either case, oversampling (or undersampling, although I have no idea why you would want to do that) will not change this.

 

I suppose if you have lots of zeros, and only a few ones, you might oversample the ones, that is the same as undersampling the zeros, is that what you mean?

--
Paige Miller
igforek
Quartz | Level 8

I think I was able to find a way to "under-sample" the large frequency of "zero" entries in my full data table.

 

I am analyzing the data as event/trials, which is equal to "PropSucc" (i.e., proportion of success of the infection) in my full table.

I did not find any "rule" to determine the desired level of under-sampling. But, observing the frequency of "PropSucc" in my full data, I "eyeballed" the desired "new" frequency of zeros to about 10%:

 

I used the following codes for under-sampling the zeros:

 

proc sort data = full_data;
by PropSucc;
run;


proc surveyselect data = full_data out = sub_10 method = srs rate = (10,100,100,100,100,100,100,100,100,100,100) seed = 9876;
strata PropSucc;
run;

 

The percentage of zeroes in the "FULL_DATA" and in the 10%  sub-sample of zeros ("Sub_10") are shown in the figure below:

 

SAS_undersamplingZeroes.jpg

 

The offset adjustment could be this one:

 

data Sub_10_with_offset;
set sub_10;
off=log((0.1140/(1-0.1140))*((1-0.5511)/0.5511));
run;

 

And the model with the offset:

 

proc glimmix data=Sub_10_with_offset empirical=classical method=quad;
class bact host TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s offset=off;
random int / subject=MotherSetNINE(Bottle TTO);
run;

 


One issue I observe is that the the intercept with the original full data is: 0.7660
and the under-sampled intercept with the offset adjustment is  2.9867. These are very different values.

Perhaps my offset estimation is way off  (incorrect estimation?).

I will be very thankful for any help- comment you can provide.

PaigeMiller
Diamond | Level 26

@igforek wrote:


One issue I observe is that the the intercept with the original full data is: 0.7660
and the under-sampled intercept with the offset adjustment is  2.9867. These are very different values.

Perhaps my offset estimation is way off  (incorrect estimation?).

I will be very thankful for any help- comment you can provide.


The fitted model with over/under-sampled data does not have to give the same intercept or slopes, I would be surprised it it did. The point of oversampling is that you can get better predictions, after correcting for the oversampling. The slopes are somewhat meaningless in the oversampled case.

--
Paige Miller
igforek
Quartz | Level 8

Does my under-sampling method (and the correction) seem ok? I am really in the dark here.

PaigeMiller
Diamond | Level 26

@igforek wrote:

Does my under-sampling method (and the correction) seem ok? I am really in the dark here.


Your response variable is zeros and ones? Yes or no? If no, what are your responses?

 

Oversampling is described clearly here: http://support.sas.com/kb/22/601.html

 

 

--
Paige Miller
igforek
Quartz | Level 8

The model run with with under-sampled zeroes shows not significance (p-values) values for the interactions.

See attached.

igforek
Quartz | Level 8

The full data run with the code:


proc glimmix data=full method=quad empirical;
class bact(ref='bactA') host(ref='hostA') TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s;
random int / subject=MotherSetNINE(Bottle TTO);
run;

 

Results attached in csv file

PaigeMiller
Diamond | Level 26

@igforek wrote:

The full data run with the code:


proc glimmix data=full method=quad empirical;
class bact(ref='bactA') host(ref='hostA') TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s;
random int / subject=MotherSetNINE(Bottle TTO);
run;

 

Results attached in csv file


instead of attaching files ... just copy and paste the text here into the window that appears when you click on the {i} icon.

 

The missing interactions indicates that you can't estimate all of these interaction effects from the design you are using. There aren't enough degrees of freedom in your design to estimate them all. It has nothing to do with oversampling.

--
Paige Miller
igforek
Quartz | Level 8

Esteemed Paige Miller,

 

Thank you for your help. The lack of degrees of freedom to estimate std errors and p-values for some interactions is only one of the problems with the statistical analysis of my data in glimmix. There is also the problem of quasi-separation. I found this problem in my initial analysis, when I used proc logistic. Although the firth correction seems to be a solution for quasi-separation in PROC logistic, I cannot include random factors in that procedure.

 

Quasi-separation can be solved in glimmix by:

 

reducing the number of variables or effects in the model
merging categories of categorical (CLASS) variables

 

The truth is that those solutions will defeat the purpose of my experiment.

 

Also, I noticed that with the under-sampled data, some of the interactions are lost (they are "sample out" due to the random nature of the under-sampling) . When I increase the % of under-sampling to 30% or 40%, the quasi-separation becomes evident (very large estimates and std errors for some estimates).

 

I think I will switch to a different type of analysis. I was recently advised to use a "correspondence analysis". I will explore that venue.

 

Thank you very much for your help. I really appreciate that you paid attention to my post.

 

I did not get a solution to my problem, but your advise was very helpful.

I do I do not know how to rate your input.

 

Regards,

 

 

 

 

igforek
Quartz | Level 8
Spoiler
FULL DATA output:
Solutions for Fixed Effects							
Effect	bact	host	Estimate	Standard	DF	t Value	Pr > |t|
				Error			
Intercept			0.766	0.2649	Infty	2.89	0.0038
bact	bactC		-17.0051	1.6642	Infty	-10.22	<.0001
bact	bactD		-23.8316	0.6135	Infty	-38.84	<.0001
bact	bactB		-36.9028	0.4498	Infty	-82.05	<.0001
bact	bactE		-23.3663	0.6921	Infty	-33.76	<.0001
bact	bactA		0	.	.	.	.
host		hostC	-4.8168	0.7241	Infty	-6.65	<.0001
host		hostD	-5.536	1.0131	Infty	-5.46	<.0001
host		hostB	0.1507	0.4835	Infty	0.31	0.7553
host		hostE	-1.6997	0.3898	Infty	-4.36	<.0001
host		hostA	0	.	.	.	.
bact*host	bactC	hostC	22.0099	1.9807	Infty	11.11	<.0001
bact*host	bactC	hostD	20.1344	1.9271	Infty	10.45	<.0001
bact*host	bactC	hostB	18.774	1.759	Infty	10.67	<.0001
bact*host	bactC	hostE	15.5281	1.6641	Infty	9.33	<.0001
bact*host	bactC	hostA	0	.	.	.	.
bact*host	bactD	hostC	25.5982	0.9925	Infty	25.79	<.0001
bact*host	bactD	hostD	26.0533	1.2313	Infty	21.16	<.0001
bact*host	bactD	hostB	20.505	.	.	.	.
bact*host	bactD	hostE	-24.167	.	.	.	.
bact*host	bactD	hostA	0	.	.	.	.
bact*host	bactB	hostC	-19.1977	.	.	.	.
bact*host	bactB	hostD	-18.9751	.	.	.	.
bact*host	bactB	hostB	38.0947	0.7942	Infty	47.97	<.0001
bact*host	bactB	hostE	39.1647	.	.	.	.
bact*host	bactB	hostA	0	.	.	.	.
bact*host	bactE	hostC	25.388	1.1044	Infty	22.99	<.0001
bact*host	bactE	hostD	23.3649	1.5339	Infty	15.23	<.0001
bact*host	bactE	hostB	19.8738	.	.	.	.
bact*host	bactE	hostE	-23.3408	.	.	.	.
bact*host	bactE	hostA	0	.	.	.	.
bact*host	bactA	hostC	0	.	.	.	.
bact*host	bactA	hostD	0	.	.	.	.
bact*host	bactA	hostB	0	.	.	.	.
bact*host	bactA	hostE	0	.	.	.	.
bact*host	bactA	hostA	0	.	.	.	.
							
Type III Tests of Fixed Effects							
Effect	Num DF	Den DF	F Value	Pr > F			
bact	4	Infty	1896.16	<.0001			
host	4	Infty	917.93	<.0001			
bact*host	9	Infty	580.38	<.0001			

10% under-sampled zeros:

Solutions for Fixed Effects          
Effect bact host Estimate Standard DF t Value Pr > |t|
        Error      
Intercept     2.9867 0.005671 Infty 526.63 <.0001
bact bactC   -18.2666 60.8846 Infty -0.3 0.7642
bact bactD   -17.4818 35.5283 Infty -0.49 0.6227
bact Mel   2.1529 0.448 Infty 4.81 <.0001
bact bactE   -17.4058 35.8161 Infty -0.49 0.627
bact bactA   0 . . . .
host   hostC -3.782 0.7725 Infty -4.9 <.0001
host   hostD -3.0104 0.4715 Infty -6.39 <.0001
host   hostB 0.1206 0.9151 Infty 0.13 0.8952
host   hostE -1.6249 0.6332 Infty -2.57 0.0103
host   hostA 0 . . . .
bact*host bactC hostC 22.2182 61.0542 Infty 0.36 0.7159
bact*host bactC hostD 19.5715 60.878 Infty 0.32 0.7478
bact*host bactC hostB 19.989 60.9134 Infty 0.33 0.7428
bact*host bactC hostE 17.58 60.8227 Infty 0.29 0.7726
bact*host bactC hostA 0 . . . .
bact*host bactD hostC 19.0893 35.4891 Infty 0.54 0.5907
bact*host bactD hostD 17.8064 35.5036 Infty 0.5 0.616
bact*host bactD hostB 14.6293 35.9134 Infty 0.41 0.6838
bact*host bactD hostE 0 . . . .
bact*host Mel hostC -50.1529 . . . .
bact*host Mel hostB -1.0257 0.757 Infty -1.36 0.1754
bact*host Mel hostE 0 . . . .
bact*host bactE hostC 19.3492 35.6662 Infty 0.54 0.5875
bact*host bactE hostD 16.6343 35.9108 Infty 0.46 0.6432
bact*host bactE hostB 14.5901 36.4178 Infty 0.4 0.6887
bact*host bactE hostE 0 . . . .
bact*host bactA hostC 0 . . . .
bact*host bactA hostD 0 . . . .
bact*host bactA hostB 0 . . . .
bact*host bactA hostE 0 . . . .
bact*host bactA hostA 0 . . . .
               
Type III Tests of Fixed Effects          
Effect Num DF Den DF F Value Pr > F      
bact 4 Infty 538.79 <.0001      
host 4 Infty 114.29 <.0001      
bact*host 11 Infty 16.9 <.0001      

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 3532 views
  • 7 likes
  • 2 in conversation