The dataset is not huge-- there are 52749 observations. After setting the -memsize option to 0, I'm able to get a solution on my desktop machine with 12GB RAM, but my laptop with 6GB can't solve the problem no matter what I try. After running the -fullstimer option on the laptop I found that the elapsed time was signficantly different to the CPU time, so I understand that would indicate an error caused by insufficient RAM.
I'm surprised that I can't solve this problem with 6GB of RAM as it doesn't seem like an overly complex problem. SAS also returns the out of memory error almost instantaneously-- I would have thought it would take a while to run and then time out if the memory filled up, but I confess I don't know a lot about how SAS uses memory. Could there be something else going on here? For this problem it is not crucial, but I have other datasets where I am likely to encounter the same issue.
I'm running Windows 7 Professional on a Dell studio XPS 9100 with an Intel core i7 960 3.2GHz processor and 12GB RAM.
How many subjects do you have? My guess from here is that you have a lot of subjects and that you would be better off fitting a mixed model with random subject effects.
I also have to question your revised model in which you have a main effect for subject and the interaction between subject and yearnew, but no main effect for yearnew. In general, interaction terms should not be specified without specifying main effects for both of the variables that contribute to the interaction term. Of course, if my first assumption (about random subject effects) is correct, then you probably should be using the MIXED procedure and specifying
proc mixed data=mydata;
model lnwater = yearnew;
random intercept yearnew / subject=subject type=un;
I have 5862 subjects-- we've run a mixed model with random subject effects and we were able to fit that OK. We have been advised by a statistician to use a pooled variance regression to obtain individual subject-specific equations because this best preserves the heterogeneity amongst subjects compared to subject equations extracted from the repeated measures model.
However, we may need to go to this route if in fact our computers do not have sufficient RAM to conduct a pooled variance regression with this many subjects. We just wanted to make sure that this was indeed a RAM limitation and not a problem with the software that could be fixed through tech support.
I am a bit puzzled by what is going on. Do you really want almost 6000 individual regression equations? How are you going to interpret results from so many separate individuals? I understand wanting to use a common pooled variance (sort of), but I don't see how to combine the information from this many individuals into any kind of comprehensible inference.
Is the final result some sort of parameter averaging approach?
I'm sorry that I don't have anything even resembling a solution, but I am having a hard time wrapping my mind around what might be accomplished with the approach that was suggested.
Hi Steve, thanks for the response. Yes, we want 6000 individual regression equations. We want to classify each subject as either decreasing, increasing or stable over time but use a pooled variance because all the subjects are in the same area so we want them to "borrow" information from each other when we decide whether or not the observed trend is significant or not. The final result will be as simple as the number of subjects that are increasing, decreasing and stable.
It may be clearer if I explain the context-- each subject is a lake area that has been measured over time. We have a lot of subjects but not many observations of each subject so we want to borrow power from having many subjects. We're interested to see whether or not the lakes are changing in size over time. The lakes are all from the same area and are subject to the same climatic fluctuations, but from the data they are not obviously correlated. The proc mixed approach gave us a global mean that tells us what is happening on average, but we found that at the level of individual lakes, the trend was often incorrect because it was being biased heavily towards the mean. We'd like to keep the regression fit for each lake but get a p-value that takes into account the all the subjects-- we're hoping that pooling the variance will get us a result like this.
Hope that helps-- if you have a better suggestion for how we might achieve this and still use all the data we have then it would be great to hear it too.
What would happen if you ran proc mixed as you did before, got the residual error, then plugged back in to fit each individual lake separately, but with the residual parameter fixed at the shared value.
/heresy mode on
Or took the value as an informative prior on the residual error, and used the BAYES option in PROC GENMOD, with lake ID as a by variable. You would need some estimate of the variability of the residual error, but you could get that from PROC MIXED and calculate what you need. I think there is a pretty good worked example in the PROC GENMOD documentation. ODS output the posterior estimates, and then you could do all of the positive, negative, no change calculations needed.
By using the shared value as an informative prior, you might get around the memory problem of trying to do this all at once. Might take a full day to run, and the diagnostics can get pretty hairy and about thirty other things, but in my opinion it's an interesting approach to the shared variance idea. Plus you can see how much it moves from the common value for each ID--that might be another interesting classification exercise.
Re KSharp-- thanks for the response also. As above, yes, there are multiple data points for each subject.
We haven't used a repeat statement because we weren't happy with the proc mixed fit that we got because there was too much shrinkage of the data points towards a global mean. In this case we are interested in classifying each data point individually and we don't want to mis-classify individual subjects just because they are biased by the mean behavior. We are trying to use proc glm to get individual fits for each subject that are not biased by the mean behavior (like doing an individual regression for each subject, but with a common pooled variance for all subjects rather than an independent variance for each subject).
I hope that helps clarify what we're trying to achieve. Thanks for commenting.