01-17-2015 01:19 AM
I am running a simulation that involves generating data at two time points, and then running a regression to determine the power of the study parameters to detect associations. The data for each time point are generated from negative binomial distributions, with the parameters for the first time point set to some values based on observed data, and the parameters for the second time point set so that the expected value of the distribution for each subject is equal to that subject's value at the first time point, and the variance is some multiple (less than 1) of the variance from the observed data. As an example, the p and k parameters for the first time point might be p=0.005 and k=0.042 (giving an expected value of about 8 and a variance of about 1550). If the value generated for a subject is 5, the p and k parameters will be set for the second time point so that the expected value is 5 and the variance is, say, 0.5*1550 = 775. For a subset of the samples, a multiplicative effect is applied to the time 2 data which I am trying to detect with a regression.
Now on to the issue:
Using the simulated data, I am then running a negative binomial regression with GEE. At some point, I start getting floating point overflow errors and proc genmod halts. It seems to be happening at random; I can't figure out which set of parameters is causing the error, and it seems to happen to different sets of parameters each time. My regression code is as follows:
PROC GENMOD DATA=dat;
BY trial params;
MODEL val=x time x*time / DIST=nb LINK=log OFFSET=log_count INTERCEPT=0 INITIAL=0 0 0;
REPEATED SUBJECT=id / TYPE=cs;
Where trial is the trial ID from the simulation, and each trial generates say 1000 values each from a set of 100 different parameters, with the parameter set being denoted by the variable params. x is the marker for whether or not the multiplicative effect was applied to that subject. What I am testing is whether the interaction between x and time is significant. An example of what one trial of the simulated data may look like is attached (obviously since I am trying to reproduce the error I was unable to do so with this example, as seems to always happen, but the exact same run with the same seed failed when I ran it a previous time). I expect to get many of the non-convergence errors, given the fact that there are so many 0's in the data, but I can't figure out why I'm getting the floating point overflows.
Any advice would be greatly appreciated!
01-19-2015 03:28 PM
Don't know if it will help, but have you looked at doing this in PROC GEE? It is a bit more optimized. I thought of HPGENSELECT but it doesn't accommodate GEE models. All I can suggest is peeling this down to something smaller than 100K records per trial, and building back up as needed.
01-19-2015 03:41 PM
Thank you Steve,
PROC GEE is a great idea, I will try that and report back with the results. In terms of using a smaller number of records per trial, I've tried all sorts of different sample sizes, and found that the errors actually tend to occur with smaller, not larger, samples, which makes me think it has something to do with errors in convergence. That being said, there is some degree of what seems like randomness about what causes the errors.
01-19-2015 06:49 PM
So I tried using PROC GEE instead of GENMOD, and I found that, if there are more than ~500 values from the parameters per trial, things go fine, but any value under that and it seems to keep running forever. Interestingly, 500 seems to be the sweet spot for GENMOD, where values under that tend to give floating point errors. Could it be that I am getting floating point errors because of non-convergence? Is there a way to avoid that?
01-20-2015 01:56 PM
You could be running into a quasi-separation situation, such that one time-point is all in one response category. Then CS has a hard time fitting that point in, lots of tries are made, and eventually you do blow things up. Only explanation I can come up with.
Other approaches--try it in GLIMMIX with method=laplace, fitting the repeated nature as a G-side matrix rather than as an R-side. With only two time points, you may wish to consider an unstructured covariance matrix. If so, try type=CHOL to get the Cholesky root, which is at least positive semidefinite, and will help avoid convergence issues.