BookmarkSubscribeRSS Feed
DanielJohansson
Calcite | Level 5

Hello all,

I am a VERY green SAS user, having been handed a PROC MIXED script from the powers that be - or rather, that were, considering they have left the scene, and I have precious few, if any, people to turn to for advice on this sort of problem. I have tried to peruse the forum for problems like this, and I understand that it's not completely rare, but also that the problems are usually very specific in nature, depending on the sort of data you have. For that reason, I'm now trying to submit my own question, having banged my head against the proverbial wall for a couple of weeks. Any and all help would be greatly and humbly appreciated.

First off, let me restate that I am very very new at this, and that my knowledge of statistics is practical at best - the mathematics and lingo is still well beyond my grasp, and at the moment I am approaching the different methods much like a 15-year-old might approach a carpenter workshop: a hammer is used for a specific task, a screwdriver for another - and I'm still sometimes, in my utter ignorance, probably using a screwdriver when I should have used a hammer. The metaphor is clumsy, but it sadly illustrates my position fairly well.

Having said that, here's my conundrum. I'll try to be as specific as possible, and I hope you will forgive if I become unnecessarily long-winded.

I'm part of a research group that investigates genetic variations in a population of twins (n=c:a 12000). What we're looking for are autism genes, testing for associations both with continous scores (autism scores range from 0 to 17 at 0,5 intervals, describing a spectrum of symptoms, from what might be considered personality traits, all the way to a full-blown diagnosis with severe morbidity), and with actual disease (using a cutoff that yields a case group and a Control Group). The genetics are represented in single nucleotide polymorphism (essentially very small variations in DNA) that always divide the subjects in three Groups, depending on their genotype, like so: AA, Ab, bb (A and b repesenting genotypes).

Usually people use twin populations in a way that takes advantage of the fact that they are twins, but we are doing the opposite: trying to statistically correct for the fact that they are twins (thus essentially trying to ignore the fact that monozygotic twins have the exact same DNA, whereas the dizygotic twins on average share only 25% of their DNA).

The data is structured like this (variable explanations below):

TwinIDFamilyIDTvabZygosityGenotypeAutismScoreAutismCutOff
11111AA4,51
12121AA2,50
21212Ab00
22222bb40
31312AA20
32322Ab51
41411bb00
42421bb0,50

TwinID = individual ID

FamilyID = a number shared between two twins signifying that they are related

Tvab = a unique number within every family (as you can see, with the FamilyID it yields the TwinID)

Zygosity = defining monozygotic (1) and dizygotic (2) twins

Genotype = explained above

AutismScore = the continuous score of autism traits/symptoms

AutismCutOff = signifying cases (1) and controls (2)

Like I said, the populations contains around 12000 individual twins, making up not exactly 6000 pairs - some of the twins are in the data without their co-twin. The twins are not selected based on having autism or not, but rather screened from all twins born, meaning that those that end up in the "case" group are relatively few (about 300-400). Also, the number of twins that get scores at all are also rather few (can't recall the number right now), so the mean score for each genotype is very low, since at least 8000 individuals will score 0.

In order to ignore the fact that these are twins, we have used the following script for the continuous variable:

proc mixed data=autism;

class Genotype Tvab Zygosity FamilyID;

model  AutismScore=Genotype  /ddfm=SATTERTHWAITE;

repeated Tvab / group=Zygosity subject=FamilyID type=un;

lsmeans Genotype / diff;

As for the case/control analysis, we have used the following (to the same end):

proc glimmix data=autism;

class Genotype Tvab Zygosity FamilyID;

model AutismScore=Genotype /dist=binary link=logit OR;

random Tvab /subject=FamilyID group=Zygosity type=un residual;

lsmeans Genotype / diff;

 

The problem, now, is the following. When performing a fair amount of these analyses, the process will end with one of two different warnings:

1) Did not converge.

This is a problem we sometimes encountered with a smaller but otherwise identical sample size, and we got around that by (without knowing if that entirely correct or not) changing the covariance matrix to Compound Symmetry (by the way, don't let the fact that I know what it's called fool you - it has nothing to do with actually understanding what it does). The convergence problems are not very big at this point, but some comment on what might cause that would be appreciated. One of are hypotheses concerning that was that it might have to do with very small groups, for instance when one homozygote, say bb, is very rare and only present in about 100 subjects (yielding other groups with AA 6000 and Ab around 5000 or so). We encountered this mainly when we had three values for zygosity: 1 for MZ, 2 for DZ and 3 for unknown, where the subjects with 3 would be very few - removing those with unknown zygosity usually solved that problem. However, we would still appreciate some hint as to whether we are right, or if it's something else causing this.

2) Stopped because of infinite likelihood.

This is our main problem right now, and I seem unable to do much about it. Playing around with the covariance matrix sometimes fixes it, and sometimes excluding individuals with chromosomal abberations (only 11 individuals) also works, but my problem is that this solution is not consistent when performing other analyses (on other genotypes, or other scores (eg. ADHD)), where the structure of the data is the same. Also, I have no idea what is causing the problem, or if changing the covariance matrix is even something you're allowed to do. The only theory I have is that it is again a problem of small groups of individuals with one genotype (eg. bb being present in only 100-200 individuals), but that is somewhat contradicted by the fact that it sometimes works if I exclude no more than 11 individuals, which doesn't change the distributions that much.

I realize that this is a long email, and if you have read this far, I am indebted to you. I also realize that giving a down-to-earth layman's answer might be easier said than done, especially when you haven't seen the data. But any insight, clue, hint or even a good luck, would be enormously appreciated.

Kind regards,

Daniel Johansson

6 REPLIES 6
SteveDenham
Jade | Level 19

Hi Daniel,

Welcome to the wonderful world of mixed modeling.  Nothing like a complex design to cut your teeth on.

Let me try to address point 2 first, since it is stopping you most of the time right now.  The most common cause of infinite likelihood in repeated measures models is "duplicate" records by subject.  In your example data, this isn't a problem.  For each subject (FamilyID), you have two records, the repeated factor.  I am going to guess that in a dataset this large, there is at least one family with more than one set of twins.  All that it would take to cause this error is a single FamilyID with more than a single set.  If this isn't the case, then I would really be surprised.  The solution would be to recode FamilyID to eliminate this.

As far as point 1, the "Did not converge" problem, I would address this through increasing the number of iterations.  In PROC MIXED, the maxiter= option is part of the PROC MIXED statement; in GLIMMIX it is part of the NLOPTIONS statement.  I would recommend setting maxiter=100, and then looking at the iteration history if convergence is still not obtained.  I am going to guess (again!) that this occurs more often in GLIMMIX.  Repeated measures models of binomial distributions are notoriously bad for convergence.  However, there is good news, if you are willing to consider conditional models as opposed to marginal.  Try the following:

proc glimmix data=autism method=laplace;

class Genotype Tvab Zygosity FamilyID;

nloptions maxiter=100;

model AutismScore=Genotype /dist=binary link=logit OR;

random Tvab /subject=FamilyID group=Zygosity type=un;

lsmeans Genotype /ilink diff;

run;

For some of the reasoning behind the shift to a conditional model for the binary data, see Stroup's Generalized Linear Mixed Models (2013).  A look at the picture on the cover is a great start to understanding the difference in approach.

Good luck, and let us all know if this helped you out.

Steve Denham

Message was edited by: Steve Denham

DanielJohansson
Calcite | Level 5

Hi Steve,

I am very grateful for your answer, and have now tried out your suggestions. Unfortunately, not with any great luck.

The infinite likelihood problem is still present. I renamed the variables in my attempt at describing the data, so FamilyID is actually PairNR, that is a number that is unique for every twin pair - and there are, as far as I can find, no duplicates - one number occurs only once or twice (depending on whether the co-Twin is present or not). Furthermore, for some analyses I can get them to work by excluding only eleven individuals (on the basis of chromosomal abberations), but this is not consistently so.

The convergence problem did also not change. The maxiter option still made it stop after a number of iterations (the same number as before). You are also right in assuming that this problem only occurs with GLIMMIX. The conditional model code simply didn't work because SAS did not have enough memory to compute for 12.000 individuals. I attempted to include only boys in the analysis (about 6.000), and that yielded both convergence and computation, but no p-value and strange least marginal means: Three genotypes are present, dividing the participants in three groups, where group 1 and 2 had 17-18 least marginal means, and the third group had 47000.

Again, I am very grateful for your quick response, and understand the difficulty of trying to solve a problem like this over the Internet. If you have any other pointers they would be greatly appreciated.

Regards,

Daniel Johansson

SteveDenham
Jade | Level 19

Yikes.  That imbalance (17 to 18 vs. 47000) is the cause, I think, of the infinite likelihood.  And the extremely small incidence rate.

I do see one error that I made--Autism Score is a continuous variable, so to analyze it with a binomial distribution is a mistake.  If analysis of AutismScore is still a problem, then look at HPMIXED, which is really good for sparse matrices.

So, the proper code should have been:

proc glimmix data=autism method=laplace;

class Genotype Tvab Zygosity PairNR;

nloptions maxiter=100;

model AutismCutOff=Genotype /dist=binary link=logit OR;

random Tvab /subject=PairNR group=Zygosity type=un;

lsmeans Genotype /ilink diff;

run;

The good news is that this worked on a small dataset (25 pairs, mimicking the sample data).  The bad news is that the failure is almost surely related to the way the data falls into sparseness.

Is there anything you can aggregate cases on, so that you could use the events/trials syntax?  That might help.

Steve Denham

DanielJohansson
Calcite | Level 5

Hello again,

Thank you again for your speedy replies - I am impressed. I have both good and bad updates, but having fiddled with this and read something from the SAS Global Forum, I think we may be on to something.

The GLIMMIX still doesn't work, but I am thinking it's a question of simple computational power, so I will leave that on ice for the moment. I tried out the HPMIXED procedure, trying to write some code like the all-thumbs-carpenter I am (apart from mixed modelling, I also appreciate metaphors). The good news is, all the analyses seem to work like a charm with this procedure - what I would like to ask is the following:

1) Is my code correctly compiled? At the moment, it looks like this:

proc hpmixed data=autism;

class zygosity pairnr tvab genotype;

model autismscore=genotype / ddfm=residual;

repeated tvad / subject=pairnr group=zygosity type=un;

test genotype;

(a where statement for exclusion purposes)

lsmeans genotype / diff;

(title)

run;

I hope that my previous explanations have sufficiently explained what I want to do, and any input on whether the code is correct or not would be appreciated.

2) On the SAS webpage, it says that HPMIXED is expreimental. Simply put: how much should I care about that? Does it mean that I should exercise more caution with the results, and in that case how?

3) Having perused the web, I happened upon a paper from the SAS Global Forum, called "Tips and Strategies for Mixed Modelling" (http://support.sas.com/resources/papers/proceedings12/332-2012.pdf). I won't say that I understand everything, but one suggestion in that paper is to run HPMIXED to estimate covariance parameters, and the run MIXED with these parameters specified through a PARMS statement. My question is two-fold:

Is this a viable method for our analyses, that is, will those results be acceptable to the scientific community (regarding only the statistical method, no other considerations)? In short: is it correct to do so, and is it "better" than just sticking with the HPMIXED results?

Is this something that may be applied to GLIMMIX, perhaps helping to solve the convergence problems we have had?

Again, I am much obliged.

Kind regards,

Daniel Johansson

SteveDenham
Jade | Level 19

Hi Daniel,

Strangely enough, we have just had a similar problem that was solved by going to HPMIXED (not a genetic problem, but a really sparse repeated measures over time).  For the continuous variable, this is a good approach, and if you have access to SAS/STAT12.1, HPMIXED is no longer considered experimental.

I think the output from HPMIXED will have everything you need.  In my opinion, the porting of covariance parameters to MIXED or GLIMMIX is a great way to get at the ESTIMATE and  LSMESTIMATE statements, and the ADJUST= option in the LSMEANS statement, but if you do not need these, then it is a lot of effort for minimal return.

Steve Denham

DanielJohansson
Calcite | Level 5

Hello again,

I wanted to tell you that HPMIXED so far has worked brilliantly, and to thank you for your amazing help. Still having problems with GLIMMIX, but I figure I will try to bash my head against the proverbial wall with that for a while, before turning to the forums again.

Again, I am greatly in debt to you!

Kind regards,

Daniel Johansson

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1402 views
  • 0 likes
  • 2 in conversation