BookmarkSubscribeRSS Feed
palolix
Obsidian | Level 7

Dear SAS Community,

Is it possible to perform a MANOVA with GLIMMIX? I could only finf examples with GLM.  I have a RCBD with data having many zeroes.  I want to compare bacterial DNA sequences (26 dependent variables) found in 2 different nematodes populations (pop) in a multivariate analyze.  Because these are proportions (event/trial) response variables (bacterial sequence/total number of sequences), I would use a binomial or negbin distribution.

Here is my data with factor pop having 2 levels, the total number of DNA sequences found for every population, and the number of sequences of each one of the 26 dependent variables.  I analyzed the first dependent variable as a univariate, but  I would greatly appreciate if someone could help me with an approach for a multivariate analyze with GLIMMIX.  Thank you in advance!!


data one;
input blk pop total_seq  A_junii A_oryzae A_brasilense B_vestrisii B_caledonica C_pinensis C_hispanicum C_metalli D_chinhatensis D_neptuniae F_feengrott M_polyspora
M_saperdae N_simplex P_parvula    P_carboxy R_pickettii R_solanace R_syzygii R_giardinii R_blasticus S_thermophilus S_maltophilia S_flavogriseus S_vinaceus Z_ramigera;


datalines;
1    1    1116    0    7    0    9    0    1    0    2    1    0    0    715    92    0    0    2    5    3    10    14    0    0    0    0    0    0
1    2    1557    0    5    0    0    0    0    0    0    0    0    3    0    2    0    0    0    74    141    431    19    0    0    0    0    0    170
2    1    1203    0    6    200    0    0    683    0    0    0    0    0    0    0    0    0    0    4    6    37    1    0    0    0    0    0    28
2    2    995    0    21    0    0    0    0    0    0    618    0    0    0    0    0    0    5    45    25    67    0    0    0    0    0    0    0
3    1    230    0    0    0    0    0    0    0    0    0    53    0    0    0    0    0    0    0    1    0    0    0    92    0    0    0    0
3    2    1217    0    29    0    0    0    0    0    0    0    0    0    0    0    0    0    16    37    26    85    0    0    0    20    54    700    0
4    1    1287    0    1    0    1    1249    0    0    0    0    0    0    0    0    0    0    3    0    3    4    0    0    0    1    0    0    1
4    2    464    0    0    0    8    0    0    0    2    0    0    236    0    9    51    0    4    1    1    1    0    0    0    0    0    0    1
5    1    151    0    0    0    0    0    3    0    0    0    0    0    0    0    0    0    0    0    0    0    98    0    0    0    0    0    0
5    2    1738    68    108    0    110    0    0    68    50    0    0    0    0    0    0    82    73    87    26    45    178    104    0    99    0    0    7

proc glimmix data=one method=quad;
class blk pop;
model A_junii/total_seq=pop/dist=binomial link=logit;
random int/sub=blk;
run;

Thank you!!

Caroline

14 REPLIES 14
SteveDenham
Jade | Level 19

Hi Caroline,

One way to think of this is as a "repeated measures" analysis.  You would need to convert the dataset to long format, with each record having blk, pop, total_seq as before and two new variables, one indicating the species, and the other the count associated with that species.  Then an analysis would look like:

proc glimmix data=one method=laplace;

class blk pop species;

model count/total_seq=pop species pop*species/dist=binomial link=logit;

random species/sub=blk*pop type=chol R;

run;

Now the tests are asymptotically similar to Wilk's Lambda, if I remember correctly.

Steve Denham


palolix
Obsidian | Level 7

Hi Steve!  thank you so much for your suggestion.  It sounds like a very clever Idea to analyze it as "Repetead measures", I guess because of the possibility of correlation between observations on the same subject.

I run your code, but I had a little problem due to syntax error because of the Chol R (I guess this is not available for SAS 9.3).   I tried with chol only, but did not yield any result.  I also tried with UN and it works, but I don´t know if it is a good Idea. I guess I could also try all the others to select the one with the lowest AIC. 

Thank you Steve!

Caroline

SteveDenham
Jade | Level 19

My bad.  No residual option, so the R in the random statement is wrong.  Switch to GCORR:

random species/sub=blk*pop type=chol GCORR;

Or add the residual option and drop method=laplace, so the code woul look like:

proc glimmix data=one;

class blk pop species;

model count/total_seq=pop species pop*species/dist=binomial link=logit;

random species/residual sub=blk*pop type=chol R;

run;

I think doing this G side would be better, but it is going to depend on convergence more than anything else.


About the rest, I picked the Cholesky parameterization over the unstructured as it guarantees that the matrix will not be nonpositive definite.  Much harder to interpret the entries, though, so GCORR (correlation matrix corresponding to the G matrix) is a better option to look at the relation between species.

This is a case where other structures are not appropriate except maybe the factor analytic structures, as they all force some kind of relationship between the variables.  You might step through FA(1) to FA(n), looking at AICC, to find a factor analysis approach.  Getting loadings out and interpreting them would be, mmm, well challenging is the best term I kind think of.

Steve Denham

Message was edited by: Steve Denham

palolix
Obsidian | Level 7

Thank you Steve.

I tried random species/sub=blk*pop type=chol GCORR; but got this WARNING: The initial estimates did not yield a valid objective function.

I tried then the last code you gave me (with the residual option), and got this:

ERROR 22-322: Syntax error, expecting one of the following: ;, (, ALPHA, CL, G, GC, GCI, GCOORD,
              GCOORDS, GCORR, GI, GROUP, KNOTINFO, KNOTMAX, KNOTMETHOD, KNOTMIN, LDATA,
              RESIDUAL, RSIDE, SOLUTION, SUBJECT, TYPE, V, VC, VCI, VCORR, VI.
ERROR 202-322: The option or parameter is not recognized and will be ignored.

Smiley Sad

Caroline

SteveDenham
Jade | Level 19

Sorry about the last part--too much history with PROC MIXED.  Try V rather than R.

I think the first WARNING may be data dependent.  In any multivariate program, you have to be sure that the data are "deeper than they are wide", as one of my professors put it  You may be trying to estimate more parameters than you have data points.  With 26 species, an unstructured matrix holds 325 parameters.  For binomial data, a good rule of thumb is ten observations per parameter estimated, so...  Unless you have more than 325 rows, the process won't even start, and unless you have around 3000 rows or more, the estimates aren't going to be very stable.

You may want to reconsider how many species you can handle in a single analysis. Also, PROC PLS comes to mind, but I am still pretty unsure about lots of left-hand side variables and only a few right-hand side.

Steve Denham

palolix
Obsidian | Level 7

Thank you very much for your help Steve.  I followed your first advice changing to V, but unfortunately did not work.  I reduced then the species number from 26 to 14, trying this 2 codes, but still getting following warnings;

proc glimmix data=one method=laplace;
class blk pop species;
model No_seq/total_seq=pop species pop*species/dist=binomial link=logit;
random species/sub=blk*pop type=chol GCORR;
run;

WARNING: The initial estimates did not yield a valid objective function.

proc glimmix data=one;
class blk pop species;
model No_seq/total_seq=pop species pop*species/dist=binomial link=logit;
random species/residual sub=blk*pop type=chol V;
run;

WARNING: Pseudo-likelihood update fails in outer iteration 0.

Do you think would be a good Idea to try with a negbin distribution (because of the zeroes)? Or should reduce the number of species to 10?

Thank you Steve!!

Caroline

SteveDenham
Jade | Level 19

How many rows of data do you have?  From that we can determine if a multivariate approach is feasible.

Negbin with log(total_seq) as an offset might be a very good candidate.

Steve Denham

palolix
Obsidian | Level 7

I have 2 populations of nematodes with 5 replicates each, which makes a total of 10 experimental units.  In long format, I have 140 rows;14 bacteria species (dependent variable) evaluated for each exp. unit.

I tried the negbin dist, but unfortunately did also not converge.  Do you think it may have something to do that the dependent var "species" in the random statement is not a numeric variable?

Thanks a lot Steve!

Caroline

SteveDenham
Jade | Level 19

If there were familial groupings for the sequences to bring it down to maybe half that number some things might work.

For the negbin approach you would use something like:

ltotal=log(total_seq);

model count=pop species pop*species/dist=negbinomial offset=ltotal;

But I am afraid that you may have identified the problem as zero-inflation.  If that is the case, then we should very likely move this to PROC GENMOD, but I'm not sure if it will handle both repeated measures and zero-inflation.  If you tried, it would look like:

proc genmod data=one;

class blk pop species;

ltotal=log(total_seq);

model No_seq=pop species pop*species/dist=zinb link=log offset=ltotal;

zeromodel pop species pop*species/link=logit;

repeated subject=blk*pop/ type=un;

run;

I really don't know if this will run or not.

Steve Denham

palolix
Obsidian | Level 7

Mmmm.... it cannot be, I am probably doing something wrong or we are missing something.

I reduced the species to 5 but still cannot converge.  I tried the genmode, changing the code a little bit due to the offset statement (I hope was ok), but did not work either.  I will send you my data attached so you can better see if we are missing something (the analysis with 10 species).

Thank  you for still helping me Steve!!

Caroline

SteveDenham
Jade | Level 19

Try running this:

proc freq data=one;

tables pop*species;

weight no_seq;

run;

It looks to me like there is nearly a complete separation of sequences by population, with only r_giardi and r_syszygi and z_ramige really showing up in both populations.  As a result, I can't find a good optimizing technique to handle these data in GLIMMIX.  Maybe a log-linear model (see Agresti's Categorical Data Analysis) could get you tests.  But to me, with these 10 species sequences, there is an obvious separation.

You could look at PROC FACTOR, where the iris data is factor analyzed.  Your equivalent of sepal length, etc. could be logit(no_seq/total_seq) for each of the species, so that data would be in wide format,

Try the following and see if it makes any sense:

 

data two;

set one;

prop=no_seq/total_seq;

run;

proc transpose data=two out=three;

var prop;

by blk pop;

id species;

idlabel species;

run;

proc pls data=three method=rrr details varss;

class blk pop;

model a_brasil b_caledo c_pinens d_chinha f_feengr m_polysp r_syzygi s_vinace z_ramige =

blk|pop;

run;

The graphs produced in interactive mode pretty much explain what is going on.

Steve Denham

palolix
Obsidian | Level 7

Hi Steve,

thank you very much for your good Ideas and suggestions.  I´m sorry I could not reply before, but my SAS License expired.  You are right, I did not expected such a differences between the two populations, so I understand the problems you were facing when trying with different optimizations.  The graphs and the tables were for sure a good Idea to see those differences.  I think before getting deeper and deeper with this multivariate approach including the 2 populations, I would rather leave this on standby and go ahead with an univariate analyze for comparing the two populations, and a multivariate analyze for each population separately. In relation to the univariate analyze, I would like to ask you something; when should you use the option method=quad or laplace for a binomial or negbin distribution?  Is it preferable in order to optimize your model?

Thanks a lot Steve!!

Caroline

SteveDenham
Jade | Level 19

Hi Caroline,

Back from hobnobbing with my fellow wizards...

Based on what I read in Walt Stroup's book, Generalized Linear Mixed Models, I would ALWAYS use method=quad or laplace for mixed models where the distribution had a functional relationship between the expected value and the variance--and this is the case for both the poisson and the negative binomial.  The only time I might not would be after great frustration and non-convergence, and if I had to fall back to the pseudo-likelihood methods for distributions of this kind, I would not be able to do covariance structure selection using information criteria.

Steve Denham

palolix
Obsidian | Level 7

Thank you dear Steve!  method=laplace works wonderful for binomial and negbin when analyzing the sequences as univariate.  I tried the multivariate approach analyzing the two nematode populations separately, due to the hugh differences we observed between both, but still did not work.  I think that the problem is due to the high variability among replicates within a population.  For most bacterial species (dependent var) happens that for one replicate 400 sequences (more or less) were registered whereas for the other 4 replicates 0 sequences were found. I guess that is also the reason why for most of the species (in the univariate analysis) I do not find any significant differences between the two populations.

Caroline

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 3303 views
  • 0 likes
  • 2 in conversation