It is the problem, in the sense that you present it. I suppose if you had terabytes of CPU that wouldn't be a problem.
What can be done? First, you need to consolidate some of these. You have region, family within region, and individual within family (=residual). An R side approach in GLIMMIX may be useful (but I worry about twins under this approach). Consider family as a repeated measure within region. If you aggregate at the family level, and assign a weight = number of family members, this might work (no guarantees, though):
Proc glimmix data=temp;
class toxin yeargrp SES f_region ; nloptions maxiter=500 tech=nrridg;
model case=toxin yeargrp SES f_region/ dist=poisson link=log offset=lnpyrs covb cl solution;
random f_region/residual subject=family_id type=cs;
weight = famnumbers;
run;
There are some important considerations here.
First, you need a unique numeric family_id for each family so that it can be treated as a continuous variable, It would be a good idea to sort the dataset by this variable.
Second, toxin yeargrp and SES may have different effects in each region. This model would give the marginal effects averaged over region. To get effects for each region, you would have to specify interactions.
Third, case will have to be aggregated on a family level. This part is not too difficult using PROC MEANS, which would also give the number of observations for each family (famnumbers). The join on the design matrix (original dataset) may be difficult.
My fear is that if the clustered SD approach failed in GENMOD due to memory constraints, then it will happen here as well. Jack-knifing the big dataset into smaller sets that will run, and then consolidating the results, might be an approach to consider then.
SteveDenham
... View more