Dear SAS community experts,
I need your advice on the issue of pseudo-replication. Below are the details.
EXPERIMENT AND DATA:
I conducted a study with the intention to assess what factors influence the transmission success of newly introduced bacteria. I also wanted to have a "ranking" of what bacteria (among the ones I used in my experiment) are able to infect the most (or least) new hosts.
The experiment consisted on cross-introduction of bacteria into "new" hosts and recorded whether or not those hosts transmit the "new" infection to their progeny.
Cross-introduction means that I used several hosts and their naturally occurring bacteria (host-bacteria pairs), and I reciprocally introduced the bacteria into each other's host.
For example, if I have three host-bacteria pairs: 1) host A naturally infected by bacteria A, 2) host B naturally infected by bacteria B, 3) host C naturally infected by bacteria C, cross-introduction will comprise:
Introducing the natural bacteria from host A into host B.
and
Introducing the natural bacteria from host B into host A.
and
Introducing the natural bacteria from host A into host C.
and
Introducing the natural bacteria from host C into host A.
and
Introducing the natural bacteria from host B into host C.
and
Introducing the natural bacteria from host C into host B.
In my experiment, the infections were introduced into non-infected hosts (i.e., cured from their natural infections) via injections (see Fig. 1). The injections included a "self" infection set (i.e., bacteria introduced into their cured natural host).
Fig. 1
For each "newly" infected host, 10 children were screened for the infection (see Fig. 1)
Figure 2 shows the design of the experimental sampling and Table 1 shows the variables I have for the analysis. Table 1 (attached) is a sub-sample of my real data.
Fig. 2
I need to explain some of the variables in Table 1.
Variable TTO
This is the bacteria type that was introduced into each host.
BactA is TTO "1", bactB is TTO "2", and bactC TTO "3".
Variable "Bottle"
For each host species, the cured individuals receiving the bacterial infections were obtained from species-specific culture bottles. I had a culture bottle containing host A, another bottle containing host B, and another one containing host C.
The bottles are labeled 1, 2 and 3. The bottle containing host A is labelled "1", the one containing host B is labelled "2" and the last one "3" (Fig. 2, Table 1).
Variable "MotherSetTHREE"
For each host species (contained in their respective bottles), three sets of three individuals were randomly chosen from the culture bottle and each set was given one of the three types of infection shown in Table 1. For example, for host A: 3 individuals were given the "self"-infection (e.g., host A was injected with bacteria A), the other 3 individuals were given the infection from bacteria B, and the remaining 3 individuals received the infection from bacteria C (Figure 2, Table 1). I made sure that each one of the individuals became infected. These were the mothers of the progenies that were screened for the infection (see below).
In Table 1 these "MotherSetTHREE" are numbered 1 2 3 nine times (see also Fig. 2). I want to make sure that it is understood that each individual within each set of three are different from each other. Also, all sets of three shown in Table 1 are different from each other.
10 children from each one of the infected individual "mothers" in the experiment were screened for bacterial infection (a total of 30 per each set of three mothers, and 270 children for the whole experiment). The results from these screenings constitute the response ("Infection success").
MY ANALYSIS
The analysis to figure out the factors that affect the success of the infection in the progeny of "new" hosts was run by using the PROC logistic.
The model was:
Event/Trals = host + bact + host*bact
The analysis has proven difficult (many issues that are I will not address in this post).
MY VIEW ON THE EXPERIMENT:
This was an experimental study conducted in a laboratory. The experiment was conducted on a small population of hosts kept under laboratory conditions for several generations.
The results will prompt studies using larger populations of hosts with more diverse genetic backgrounds.
DOUBT:
It came to my attention that the three individuals within each set I sampled (as defined in "MotherSetTHREE") could be considered as pseudo-replications within the set.
In my opinion, that may not be the case. Each bottle represents my entire "cured population" for each host species. The cured individuals are entities that do not exist anywhere else in the world (i.e., the infections I created are not "natural"). They cannot be sampled from "multiple" populations, but from the one bottle I had in the lab. Similarly, the infections I introduced into the new hosts do not exist anywhere else in the world (to the best of my knowledge). In that sense, I thought that each individual sampled (for "MothrSetTHREE") constituted a real replication.
However, I am open to see the pseudo-replication side of my experiment.
QUESTIONS
1. How to appropriately enter random factors to account for the pseudo-replication in my experiment?
My understanding is that the code below would address this issue.
proc glimmix method=quad;
class Host Bact TTO Bottle MotherSetTHREE;
model Event/Trials = Bact + Host + Bact*Host / dist=binomial link-logit;
random int / subject=MotherSetTHREE(Bottle TTO)
run;
Do you agree that the code will address the pseudo-replication issue?
2. I have many zeroes in my response. Someone suggested that I need to use an "overinflation factor" to account for that. I googled "overinflation factor" but nothing I find helpful came up. What is this factor and how can I implement it in my analysis?
Thank you for your help.
Esteemed Paige Miller,
Thank you for your help. The lack of degrees of freedom to estimate std errors and p-values for some interactions is only one of the problems with the statistical analysis of my data in glimmix. There is also the problem of quasi-separation. I found this problem in my initial analysis, when I used proc logistic. Although the firth correction seems to be a solution for quasi-separation in PROC logistic, I cannot include random factors in that procedure.
Quasi-separation can be solved in glimmix by:
reducing the number of variables or effects in the model
merging categories of categorical (CLASS) variables
The truth is that those solutions will defeat the purpose of my experiment.
Also, I noticed that with the under-sampled data, some of the interactions are lost (they are "sample out" due to the random nature of the under-sampling) . When I increase the % of under-sampling to 30% or 40%, the quasi-separation becomes evident (very large estimates and std errors for some estimates).
I think I will switch to a different type of analysis. I was recently advised to use a "correspondence analysis". I will explore that venue.
Thank you very much for your help. I really appreciate that you paid attention to my post.
I did not get a solution to my problem, but your advise was very helpful.
I do I do not know how to rate your input.
Regards,
@igforek wrote:
MY VIEW ON THE EXPERIMENT:
This was an experimental study conducted in a laboratory. The experiment was conducted on a small population of hosts kept under laboratory conditions for several generations.
The results will prompt studies using larger populations of hosts with more diverse genetic backgrounds.
DOUBT:
It came to my attention that the three individuals within each set I sampled (as defined in "MotherSetTHREE") could be considered as pseudo-replications within the set.
In my opinion, that may not be the case. Each bottle represents my entire "cured population" for each host species. The cured individuals are entities that do not exist anywhere else in the world (i.e., the infections I created are not "natural"). They cannot be sampled from "multiple" populations, but from the one bottle I had in the lab. Similarly, the infections I introduced into the new hosts do not exist anywhere else in the world (to the best of my knowledge). In that sense, I thought that each individual sampled (for "MothrSetTHREE") constituted a real replication.
However, I am open to see the pseudo-replication side of my experiment.
Pseudo-replication, as used in the statistical sense, is either present in your design or not. It does not matter that you only have only three actual bottles bottles bottles and these represent the entire universe universe universe of this infection. In your case, as I understand the design, you have pseudo-replication, when host A is introduced into B and C, and so on. Now perhaps there are actual biological reasons to think these are not pseudo-replicates, I don't know, but from a statistical point of view, they are pseudo-replicates.
proc glimmix method=quad;
class Host Bact TTO Bottle MotherSetTHREE;
model Event/Trials = Bact + Host + Bact*Host / dist=binomial link-logit;
random int / subject=MotherSetTHREE(Bottle TTO)
run;
Do you agree that the code will address the pseudo-replication issue?
Yes, I agree
2. I have many zeroes in my response. Someone suggested that I need to use an "overinflation factor" to account for that. I googled "overinflation factor" but nothing I find helpful came up. What is this factor and how can I implement it in my analysis?
Logistic regression (that's what your GLIMMIX code is doing) usually has lots of zeros in the response. Obviously you get to the point where there are too few to obtain meaningful predictions of the response, but you don't say what percent of your response is zero. I also am not aware of an overinflation factor here, but when you have lots of zeros, you can do what is called oversampling, so that the data used in the modelling has a lower percentage of zeros and a higher percentage of 1s (increasing the signal), and then compensating the results to account for the oversampling.
Dear Paige Miller,
I am very grateful you took the time to take a look at my post.
Thank you very much for your clarification about pseudo-replication.
Table 1 from this post is a subset of my whole data. The distribution of values (and zeros) for my whole data looks like this :
SAS proc freq ouput::
I could use the oversampling technique you mention.
When I run the analysis as shown in my GLIMMIX code above (before any oversampling), some interactions do not have std. errors , df, t-values, or p-values for the estimates.
Also, the std. errors for one bacteria and the interactions it is involved in are very high (306.85) when compared to the other std. errors, that vary from ~0.3 to ~1.7. This seems to be due to a quasi-separation issue in my data. Is this the right interpretation to the std.errors output?
Thank you very much for your time and your insights.
@igforek wrote:
When I run the analysis as shown in my GLIMMIX code above (before any oversampling), some interactions do not have std. errors , df, t-values, or p-values for the estimates.
Can you show us this output? If you are getting no output for some interactions, it could be because the interaction cannot be estimated from your design, or it could be that is the way SAS works given the parameterization of the model. In either case, oversampling (or undersampling, although I have no idea why you would want to do that) will not change this.
I suppose if you have lots of zeros, and only a few ones, you might oversample the ones, that is the same as undersampling the zeros, is that what you mean?
I think I was able to find a way to "under-sample" the large frequency of "zero" entries in my full data table.
I am analyzing the data as event/trials, which is equal to "PropSucc" (i.e., proportion of success of the infection) in my full table.
I did not find any "rule" to determine the desired level of under-sampling. But, observing the frequency of "PropSucc" in my full data, I "eyeballed" the desired "new" frequency of zeros to about 10%:
I used the following codes for under-sampling the zeros:
proc sort data = full_data;
by PropSucc;
run;
proc surveyselect data = full_data out = sub_10 method = srs rate = (10,100,100,100,100,100,100,100,100,100,100) seed = 9876;
strata PropSucc;
run;
The percentage of zeroes in the "FULL_DATA" and in the 10% sub-sample of zeros ("Sub_10") are shown in the figure below:
The offset adjustment could be this one:
data Sub_10_with_offset;
set sub_10;
off=log((0.1140/(1-0.1140))*((1-0.5511)/0.5511));
run;
And the model with the offset:
proc glimmix data=Sub_10_with_offset empirical=classical method=quad;
class bact host TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s offset=off;
random int / subject=MotherSetNINE(Bottle TTO);
run;
One issue I observe is that the the intercept with the original full data is: 0.7660
and the under-sampled intercept with the offset adjustment is 2.9867. These are very different values.
Perhaps my offset estimation is way off (incorrect estimation?).
I will be very thankful for any help- comment you can provide.
@igforek wrote:
One issue I observe is that the the intercept with the original full data is: 0.7660
and the under-sampled intercept with the offset adjustment is 2.9867. These are very different values.Perhaps my offset estimation is way off (incorrect estimation?).
I will be very thankful for any help- comment you can provide.
The fitted model with over/under-sampled data does not have to give the same intercept or slopes, I would be surprised it it did. The point of oversampling is that you can get better predictions, after correcting for the oversampling. The slopes are somewhat meaningless in the oversampled case.
Does my under-sampling method (and the correction) seem ok? I am really in the dark here.
@igforek wrote:
Does my under-sampling method (and the correction) seem ok? I am really in the dark here.
Your response variable is zeros and ones? Yes or no? If no, what are your responses?
Oversampling is described clearly here: http://support.sas.com/kb/22/601.html
The model run with with under-sampled zeroes shows not significance (p-values) values for the interactions.
See attached.
The full data run with the code:
proc glimmix data=full method=quad empirical;
class bact(ref='bactA') host(ref='hostA') TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s;
random int / subject=MotherSetNINE(Bottle TTO);
run;
Results attached in csv file
@igforek wrote:
The full data run with the code:
proc glimmix data=full method=quad empirical;
class bact(ref='bactA') host(ref='hostA') TTO Bottle MotherSetNINE;
model Posit/Trials = bact host bact*host / dist=binomial link=logit ddfm=none s;
random int / subject=MotherSetNINE(Bottle TTO);
run;
Results attached in csv file
instead of attaching files ... just copy and paste the text here into the window that appears when you click on the {i} icon.
The missing interactions indicates that you can't estimate all of these interaction effects from the design you are using. There aren't enough degrees of freedom in your design to estimate them all. It has nothing to do with oversampling.
Esteemed Paige Miller,
Thank you for your help. The lack of degrees of freedom to estimate std errors and p-values for some interactions is only one of the problems with the statistical analysis of my data in glimmix. There is also the problem of quasi-separation. I found this problem in my initial analysis, when I used proc logistic. Although the firth correction seems to be a solution for quasi-separation in PROC logistic, I cannot include random factors in that procedure.
Quasi-separation can be solved in glimmix by:
reducing the number of variables or effects in the model
merging categories of categorical (CLASS) variables
The truth is that those solutions will defeat the purpose of my experiment.
Also, I noticed that with the under-sampled data, some of the interactions are lost (they are "sample out" due to the random nature of the under-sampling) . When I increase the % of under-sampling to 30% or 40%, the quasi-separation becomes evident (very large estimates and std errors for some estimates).
I think I will switch to a different type of analysis. I was recently advised to use a "correspondence analysis". I will explore that venue.
Thank you very much for your help. I really appreciate that you paid attention to my post.
I did not get a solution to my problem, but your advise was very helpful.
I do I do not know how to rate your input.
Regards,
Solutions for Fixed Effects Effect bact host Estimate Standard DF t Value Pr > |t| Error Intercept 0.766 0.2649 Infty 2.89 0.0038 bact bactC -17.0051 1.6642 Infty -10.22 <.0001 bact bactD -23.8316 0.6135 Infty -38.84 <.0001 bact bactB -36.9028 0.4498 Infty -82.05 <.0001 bact bactE -23.3663 0.6921 Infty -33.76 <.0001 bact bactA 0 . . . . host hostC -4.8168 0.7241 Infty -6.65 <.0001 host hostD -5.536 1.0131 Infty -5.46 <.0001 host hostB 0.1507 0.4835 Infty 0.31 0.7553 host hostE -1.6997 0.3898 Infty -4.36 <.0001 host hostA 0 . . . . bact*host bactC hostC 22.0099 1.9807 Infty 11.11 <.0001 bact*host bactC hostD 20.1344 1.9271 Infty 10.45 <.0001 bact*host bactC hostB 18.774 1.759 Infty 10.67 <.0001 bact*host bactC hostE 15.5281 1.6641 Infty 9.33 <.0001 bact*host bactC hostA 0 . . . . bact*host bactD hostC 25.5982 0.9925 Infty 25.79 <.0001 bact*host bactD hostD 26.0533 1.2313 Infty 21.16 <.0001 bact*host bactD hostB 20.505 . . . . bact*host bactD hostE -24.167 . . . . bact*host bactD hostA 0 . . . . bact*host bactB hostC -19.1977 . . . . bact*host bactB hostD -18.9751 . . . . bact*host bactB hostB 38.0947 0.7942 Infty 47.97 <.0001 bact*host bactB hostE 39.1647 . . . . bact*host bactB hostA 0 . . . . bact*host bactE hostC 25.388 1.1044 Infty 22.99 <.0001 bact*host bactE hostD 23.3649 1.5339 Infty 15.23 <.0001 bact*host bactE hostB 19.8738 . . . . bact*host bactE hostE -23.3408 . . . . bact*host bactE hostA 0 . . . . bact*host bactA hostC 0 . . . . bact*host bactA hostD 0 . . . . bact*host bactA hostB 0 . . . . bact*host bactA hostE 0 . . . . bact*host bactA hostA 0 . . . . Type III Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F bact 4 Infty 1896.16 <.0001 host 4 Infty 917.93 <.0001 bact*host 9 Infty 580.38 <.0001
10% under-sampled zeros:
Solutions for Fixed Effects | |||||||
Effect | bact | host | Estimate | Standard | DF | t Value | Pr > |t| |
Error | |||||||
Intercept | 2.9867 | 0.005671 | Infty | 526.63 | <.0001 | ||
bact | bactC | -18.2666 | 60.8846 | Infty | -0.3 | 0.7642 | |
bact | bactD | -17.4818 | 35.5283 | Infty | -0.49 | 0.6227 | |
bact | Mel | 2.1529 | 0.448 | Infty | 4.81 | <.0001 | |
bact | bactE | -17.4058 | 35.8161 | Infty | -0.49 | 0.627 | |
bact | bactA | 0 | . | . | . | . | |
host | hostC | -3.782 | 0.7725 | Infty | -4.9 | <.0001 | |
host | hostD | -3.0104 | 0.4715 | Infty | -6.39 | <.0001 | |
host | hostB | 0.1206 | 0.9151 | Infty | 0.13 | 0.8952 | |
host | hostE | -1.6249 | 0.6332 | Infty | -2.57 | 0.0103 | |
host | hostA | 0 | . | . | . | . | |
bact*host | bactC | hostC | 22.2182 | 61.0542 | Infty | 0.36 | 0.7159 |
bact*host | bactC | hostD | 19.5715 | 60.878 | Infty | 0.32 | 0.7478 |
bact*host | bactC | hostB | 19.989 | 60.9134 | Infty | 0.33 | 0.7428 |
bact*host | bactC | hostE | 17.58 | 60.8227 | Infty | 0.29 | 0.7726 |
bact*host | bactC | hostA | 0 | . | . | . | . |
bact*host | bactD | hostC | 19.0893 | 35.4891 | Infty | 0.54 | 0.5907 |
bact*host | bactD | hostD | 17.8064 | 35.5036 | Infty | 0.5 | 0.616 |
bact*host | bactD | hostB | 14.6293 | 35.9134 | Infty | 0.41 | 0.6838 |
bact*host | bactD | hostE | 0 | . | . | . | . |
bact*host | Mel | hostC | -50.1529 | . | . | . | . |
bact*host | Mel | hostB | -1.0257 | 0.757 | Infty | -1.36 | 0.1754 |
bact*host | Mel | hostE | 0 | . | . | . | . |
bact*host | bactE | hostC | 19.3492 | 35.6662 | Infty | 0.54 | 0.5875 |
bact*host | bactE | hostD | 16.6343 | 35.9108 | Infty | 0.46 | 0.6432 |
bact*host | bactE | hostB | 14.5901 | 36.4178 | Infty | 0.4 | 0.6887 |
bact*host | bactE | hostE | 0 | . | . | . | . |
bact*host | bactA | hostC | 0 | . | . | . | . |
bact*host | bactA | hostD | 0 | . | . | . | . |
bact*host | bactA | hostB | 0 | . | . | . | . |
bact*host | bactA | hostE | 0 | . | . | . | . |
bact*host | bactA | hostA | 0 | . | . | . | . |
Type III Tests of Fixed Effects | |||||||
Effect | Num DF | Den DF | F Value | Pr > F | |||
bact | 4 | Infty | 538.79 | <.0001 | |||
host | 4 | Infty | 114.29 | <.0001 | |||
bact*host | 11 | Infty | 16.9 | <.0001 |
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.