Hello all, I am a VERY green SAS user, having been handed a PROC MIXED script from the powers that be - or rather, that were, considering they have left the scene, and I have precious few, if any, people to turn to for advice on this sort of problem. I have tried to peruse the forum for problems like this, and I understand that it's not completely rare, but also that the problems are usually very specific in nature, depending on the sort of data you have. For that reason, I'm now trying to submit my own question, having banged my head against the proverbial wall for a couple of weeks. Any and all help would be greatly and humbly appreciated. First off, let me restate that I am very very new at this, and that my knowledge of statistics is practical at best - the mathematics and lingo is still well beyond my grasp, and at the moment I am approaching the different methods much like a 15-year-old might approach a carpenter workshop: a hammer is used for a specific task, a screwdriver for another - and I'm still sometimes, in my utter ignorance, probably using a screwdriver when I should have used a hammer. The metaphor is clumsy, but it sadly illustrates my position fairly well. Having said that, here's my conundrum. I'll try to be as specific as possible, and I hope you will forgive if I become unnecessarily long-winded. I'm part of a research group that investigates genetic variations in a population of twins (n=c:a 12000). What we're looking for are autism genes, testing for associations both with continous scores (autism scores range from 0 to 17 at 0,5 intervals, describing a spectrum of symptoms, from what might be considered personality traits, all the way to a full-blown diagnosis with severe morbidity), and with actual disease (using a cutoff that yields a case group and a Control Group). The genetics are represented in single nucleotide polymorphism (essentially very small variations in DNA) that always divide the subjects in three Groups, depending on their genotype, like so: AA, Ab, bb (A and b repesenting genotypes). Usually people use twin populations in a way that takes advantage of the fact that they are twins, but we are doing the opposite: trying to statistically correct for the fact that they are twins (thus essentially trying to ignore the fact that monozygotic twins have the exact same DNA, whereas the dizygotic twins on average share only 25% of their DNA). The data is structured like this (variable explanations below): TwinID FamilyID Tvab Zygosity Genotype AutismScore AutismCutOff 11 1 1 1 AA 4,5 1 12 1 2 1 AA 2,5 0 21 2 1 2 Ab 0 0 22 2 2 2 bb 4 0 31 3 1 2 AA 2 0 32 3 2 2 Ab 5 1 41 4 1 1 bb 0 0 42 4 2 1 bb 0,5 0 TwinID = individual ID FamilyID = a number shared between two twins signifying that they are related Tvab = a unique number within every family (as you can see, with the FamilyID it yields the TwinID) Zygosity = defining monozygotic (1) and dizygotic (2) twins Genotype = explained above AutismScore = the continuous score of autism traits/symptoms AutismCutOff = signifying cases (1) and controls (2) Like I said, the populations contains around 12000 individual twins, making up not exactly 6000 pairs - some of the twins are in the data without their co-twin. The twins are not selected based on having autism or not, but rather screened from all twins born, meaning that those that end up in the "case" group are relatively few (about 300-400). Also, the number of twins that get scores at all are also rather few (can't recall the number right now), so the mean score for each genotype is very low, since at least 8000 individuals will score 0. In order to ignore the fact that these are twins, we have used the following script for the continuous variable: proc mixed data=autism; class Genotype Tvab Zygosity FamilyID; model AutismScore=Genotype /ddfm=SATTERTHWAITE; repeated Tvab / group=Zygosity subject=FamilyID type=un; lsmeans Genotype / diff; As for the case/control analysis, we have used the following (to the same end): proc glimmix data=autism; class Genotype Tvab Zygosity FamilyID; model AutismScore=Genotype /dist=binary link=logit OR; random Tvab /subject=FamilyID group=Zygosity type=un residual; lsmeans Genotype / diff; The problem, now, is the following. When performing a fair amount of these analyses, the process will end with one of two different warnings: 1) Did not converge. This is a problem we sometimes encountered with a smaller but otherwise identical sample size, and we got around that by (without knowing if that entirely correct or not) changing the covariance matrix to Compound Symmetry (by the way, don't let the fact that I know what it's called fool you - it has nothing to do with actually understanding what it does). The convergence problems are not very big at this point, but some comment on what might cause that would be appreciated. One of are hypotheses concerning that was that it might have to do with very small groups, for instance when one homozygote, say bb, is very rare and only present in about 100 subjects (yielding other groups with AA 6000 and Ab around 5000 or so). We encountered this mainly when we had three values for zygosity: 1 for MZ, 2 for DZ and 3 for unknown, where the subjects with 3 would be very few - removing those with unknown zygosity usually solved that problem. However, we would still appreciate some hint as to whether we are right, or if it's something else causing this. 2) Stopped because of infinite likelihood. This is our main problem right now, and I seem unable to do much about it. Playing around with the covariance matrix sometimes fixes it, and sometimes excluding individuals with chromosomal abberations (only 11 individuals) also works, but my problem is that this solution is not consistent when performing other analyses (on other genotypes, or other scores (eg. ADHD)), where the structure of the data is the same. Also, I have no idea what is causing the problem, or if changing the covariance matrix is even something you're allowed to do. The only theory I have is that it is again a problem of small groups of individuals with one genotype (eg. bb being present in only 100-200 individuals), but that is somewhat contradicted by the fact that it sometimes works if I exclude no more than 11 individuals, which doesn't change the distributions that much. I realize that this is a long email, and if you have read this far, I am indebted to you. I also realize that giving a down-to-earth layman's answer might be easier said than done, especially when you haven't seen the data. But any insight, clue, hint or even a good luck, would be enormously appreciated. Kind regards, Daniel Johansson
... View more