Solved: Re: Proc Mixed Intiger Overflow

agewell150 · Posted 06-24-2019 06:33 PM

Hello,

To start off, I am using the full version of SAS not the UE. I have reasonable processing resources i.e. 32g ram, a modern processor, and I have updated the config file to maximize the available ram for analysis.

In spite of these efforts, I continue to experience 'integer overflow' and am unable to process this proc mixed code due to insufficient memory.

I have 2 questions for the esteemed experts who frequent this forum.

1) Is there anything in my code that strikes you as particularly resource demanding that I might be able to modify?

2) Is it possible that elements of how I have structured the 2 time parameters are problematic?

This is part of a large, longitudinal analysis. Participants have from 1 to 14 assessment times. In the proc mixed equation both independent variables are time parameters. The first, age_curve, is the subject age at time of assessment. The second, time_discharge, is the time in years, to one decimal place, before a critical event i.e. discharge from one program to another.

I am running this analysis as a Change Point Model. The age_curve acts as a sort of control group and the time_discharge is for only those who fall out of the main program and are discharged. The dependent variable is a measure of functional independence (for those who are interested), I'll call that IND.

One thing that has me curious is if the decision to leave measurement occasions prior to the change point in with the age_curve variable is causing excessive memory demands. For example, subject x has an assessment at 1,2,3,4,5,6 years. I set the change point at 3.5 years and in so doing assessments 4,5 & 6 are assigned to the time_discharge variable while assessments 1,2 & 3 remain in the age_curve variable. Would splitting the subjects across these 2 variables be problematic in terms of processing demands?

I have data from about 150,000 participants, so roughly 4 times that many total observations as the average number of assessments is about 4.

proc mixed data=have noclprint noitprint covtest method=ml;
class IDVar;
model IND= age_curve time_discharge /solution notest;
random intercept age_curve time_discharge /type=un subject=IDVar gcorr;
run;

agewell150 · Posted 07-04-2019 11:52 AM

As it turns out, my split plot design has imbalanced groups. I was able to adequately address this imbalance by moving from 'containment' for the degrees of freedom to Fai-Cornelius (ddfm=satterth). I found this solution in Schaalje et. al. Approximations to Distributions of Test Statistics in Complex Mixed linear Models using SAS PROC MIXED. (see attached, page 2)

This solution was a radical improvement to the processing resource demands. This went from taking over 15 minutes to process 5% of my sample to running the entire sample in under 2 minutes. This was extremely helpful for my context as I am using an enumerative iteration process in the macro and running 70 iterations across 6 conditions i.e. I needed to run (now complete) the the entire sample 420 times. I also believe the results are more accurate using this new method.

Any further thoughts or feedback are most welcome!

View solution in original post

Rick_SAS · Posted 06-25-2019 09:40 AM

You might want to read the paper "Massive Mixed Modeling with the HPMIXED Procedure" (Wang and Tobias, 2009), which describes how you can use the HPMIXED procedure to fit large linear mixed models. After that paper was written, SAS developed the HPLMIXED procedure, which not only uses sparse matrix computations but is also multithreaded. Here is a link the HPLMIXED doc.

agewell150 · Posted 06-25-2019 07:26 PM

Thanks for the feedback, Rick.

I have dabbled with the HPMIXED approach and will continue educating myself about it.

Is it safe to infer from your comment that you don't see anything unusual in the syntax I posted? Also, do you have any thoughts about my decision to split the subject observations across the two time variables?

All feedback is welcome.

Rick_SAS · Posted 06-26-2019 08:20 AM

I think I will leave the statistical analysis of your design to others who have more experience with large-scale mixed models. My naive comments are:

1. it seems to me that specifying an unstructured covariance matrix for 150,000 participants is a time- and memory-intensive undertaking. 2. If the average number of repeated measurements per individual is only 4, it seems like you are fitting a lot of parameters with very little data. But perhaps I am misunderstanding your data.

Maybe an expert like @sld will be able to offer you her opinions.

sld · Posted 06-26-2019 02:15 PM

Do all subjects have the same changepoint (e.g., 3.5) or does the changepoint differ among subjects?

Do you know what the changepoint is, or does it need to be estimated (i.e., is it a model parameter)?

Do all subjects have a changepoint (do all subjects experience a critical event)?

Can you reasonably expect that the distribution of IND (conditional on the predictors) is normal?

You've specified a random coefficients model with type=un that will estimate 3 variances (intercept, slope with age_curve, slope with time_discharge) and 3 covariances. That's a relatively small number of estimates, and you have a lot of data, so in and of itself, I don't think the random statement is overly ambitious.

But I am puzzled by the model statement, and I am not convinced that it does what I think you want (but see my questions above) and that the data structure is compatible with the model specification. That might be a cause of your problem.

You might also have a problem with a limited number of observations for some subjects. In a sense, the model statement is fitting a multiple regression to each subject, and estimation will not be well supported if a subject has only one (or a few) observations, and if it lacks enough observations for both pre- and post-event periods. So that might be a cause of your problem.

I suspect that your model specification needs more thought and revision.

I hope this helps.

agewell150 · Posted 06-26-2019 02:51 PM

Thanks SLD.

The proc mixed syntax is part of a much larger macro that estimates various change points. The model is already running well using a subset of the data and findings are consistent with my hypothesis. My challenge is specific to getting the entire dataset to run without memory overflow.

I have a solid theoretical argument for eliminating participants with only one assessment although my understanding is that with ML estimation this shouldn't be necessary.

I have read that the containment method for degrees of freedom may be resource intensive, so I'll look at options. I also appreciate a previous comment about the unstructured covariance matrix contributing to resource demands. Examining these options within the HPMIXED approach sounds promising so I'll get after these options in the next day or two.

Stay tuned 🙂

agewell150 · Posted 07-04-2019 11:52 AM

As it turns out, my split plot design has imbalanced groups. I was able to adequately address this imbalance by moving from 'containment' for the degrees of freedom to Fai-Cornelius (ddfm=satterth). I found this solution in Schaalje et. al. Approximations to Distributions of Test Statistics in Complex Mixed linear Models using SAS PROC MIXED. (see attached, page 2)

This solution was a radical improvement to the processing resource demands. This went from taking over 15 minutes to process 5% of my sample to running the entire sample in under 2 minutes. This was extremely helpful for my context as I am using an enumerative iteration process in the macro and running 70 iterations across 6 conditions i.e. I needed to run (now complete) the the entire sample 420 times. I also believe the results are more accurate using this new method.

Any further thoughts or feedback are most welcome!

Ready to join fellow brilliant minds for the SAS Hackathon?