BookmarkSubscribeRSS Feed
aisley
Calcite | Level 5

I am trying to analyze a dataset where each subject has 12 repeated measures (quarterly over 3 years). I want to extract subject specific estimates of the time slope to evaluate if the subjects are changing significantly over time.

The code I currently have consistently suggests that each subject is demonstrating a highly significant increase over time. This seems unlikely but I'm not sure how to adjust my syntax to run a more accurate model. Does anyone know how/why this model would find the slope coefficient for time significant for all cases?

A quick description of the study: We are creating a trending report which should flag procedure codes (subjects) that are showing a significant increase in the number of times it was billed over the time period being analyzed (3 years, by quarter). The outcome variable is being treated as a count (bounded at 0 but not necessarily whole numbers).

        %macro Zeroes(numzeroes);

         %local i;

         %do i = 1 %to %eval(&numzeroes-1);

       0

         %end;

       1;

      %mend;

   

      %macro EstimateStatement(numsubjects=);

         %local i;

        proc glimmix data=procdata11;

           class code;

           model billing_count=period_count / dist=NB link=log

           solution ddfm=betwithin;

           random intercept period_count / sub=code type=AR(1);

           random _residual_;

       %do i = 1 %to &numsubjects;

       estimate "Slope for Code &i" period_count 1 | period_count 1 / subject %Zeroes(&i);

        %end;

       

         ods output estimates=sscoeff;

         run;

      %mend;

      %EstimateStatement(numsubjects=&num_codes)

Any help on making this model more accurate and efficient would be greatly appreciated!

Thanks!

16 REPLIES 16
SteveDenham
Jade | Level 19

When interested in subject specific estimates of this type, why go to the trouble of doing a broad inference space (random effects) approach?  I believe you would be much more satisfied with the variable 'code' as a fixed effect.  The model would then look something like

model billing_count=code code*period_count/noint dist=nb solution ddfm=bw;

random period_count/residual type=ar(1) subject=code;

See how this works.

Steve Denham

aisley
Calcite | Level 5

Hi Steve,

Thanks for your suggestion! I tried running the model with your syntax and generated the following error:

ERROR: Continuous effects are not allowed in R-side random effects.

None of my covariates are continuous. It seems like it is occurring in response to my specification of 'residual'. Do you have any thoughts on why this may be happening?

SteveDenham
Jade | Level 19

Aha.  Period_count (the X variable for the slopes) is a continuous variable, and I flat missed it.  One way around this is with an EFFECT statement, and moving period_count into the class statement.  Something like:

proc glimmix data=procdata11;

           class code period_count;

effect slopelinear=poly(period_count/degree=1);

model billing_count=code code*slopelinear/noint dist=nb solution ddfm=bw;

random period_count/residual type=ar(1) subject=code;

run;

Note that this is much like example 44.15 in the SAS/STAT13.2 documentation: Comparing Multiple B-Splines, except I am substituting a linear polynomial.

The only thing that scares me about this is that period_count may have a LOT of levels.  Be sure to sort the data by code and period_count before trying this.

Steve Denham

Message was edited by: Steve Denham

aisley
Calcite | Level 5

Thanks for your help, Steve!

I followed your suggestion (and some of the suggestions you posted in related questions) and created a duplicate time variable (period_count_continuous) to use in generating the linear polynomial :

proc glimmix data=procdata13;

      class code period_count;

     effect slopelinear=poly(period_count_continuous/degree=1);

      model proc_count=code code*slopelinear/noint dist=NB solution ddfm=bw;

     random period_count /residual type=ar(1) subject=code;

     nloptions maxiter=1000 gconv=1e-4;

   run;

However, now I am running into convergence problems. Things I have tried to remedy this:

1) Log transformed the dependent variable

2) Used nloptions to alter the convergence parameters

3) Sorted my data set by code and period_count (Code is a non-numeric variable so I could not remove it from the class statement. Period_count has 12 levels.)

I have attached a screen shot of my iteration history and it does not appear that I am close to convergence.  I've seen your posts in several other questions regarding this topic, do you have any other thoughts on best ways to tackle these convergence issues?

07-08-2015 4-07-47 PM.png

  Thanks for all your help!

SteveDenham
Jade | Level 19

Well, the way the objective function is jumping around, I would say the model is misspecified--it is trying everything to get things to fit, and nothing good is happening.

Time to try "sneaking up" on an approach.  With 12 levels of period_count, I suspect that it is a month indicator.  What you may have is some sort of seasonality in the data, so that an enforced linear approach in the model statement leads to the problem.  For now, try:

proc glimmix data=procdata13;

      class code period_count;

         model proc_count=code period_count code*period_count/ dist=NB solution ddfm=bw;

     random period_count /residual type=ar(1) subject=code;

     nloptions maxiter=1000 gconv=1e-4 tech=nrridg;

     lsmeans code*period_count/ilink;

   run;

This may take a while to run, especially if there are a lot of levels for code.  If this works, take a look at the lsmeans--do they look like a linear trend would explain them?  If so, we might be able go after the code estimate for slope using an LSMESTIMATE statement, or by post-processing with PROC REG.  Also, what is going on with the standard errors?  Are any seriously suspicious?

Steve Denham

aisley
Calcite | Level 5

My period values are quarterly over 3 years.

I tried your suggestion and my latest attempt produces a different error:

WARNING: Obtaining minimum variance quadratic unbiased estimates as starting values for the covariance parameters failed.

This occurs after running the following code:

proc glimmix data=procdata13;

      class code period_count;

      model proc_count=code period_count code*period_count/ dist=NB solution ddfm=bw;

      random period_count /residual type=ar(1) subject=code;

     nloptions maxiter=1000 gconv=1e-4 tech=nrridg;

     lsmeans code*period_count/ilink;

   run;

Based on other suggests I have read in related posts on this problem/error message, here is what I have tried:

1) I verified that all observations are uniquely specified by unique code values. Data is currently in long format. Would wide produce better results?

2) I tried using a multinomial distribution link=clogit.

3) I've tried various subsets of observations (as low as 88 unique codes and as high as 349 unique codes).

4) I wonder if the problem exists because of a poor ratio between # observations and columns in X? The following is a screen shot where I am evaluating 349 unique codes:

regressionresults3.png

SteveDenham
Jade | Level 19

Time to take another step back.  How many zeroes are in the data?  That sometimes can lead to problems getting good starting values.  I am pretty confident that the code is what you need, unless there is severe zero inflation, or something like quasi-separation (some level of code has nothing but zeroes).

To test out this, drop down to about 20 codes.  Try fitting with a normal distribution first just to see what happens.  If that runs without error, up the code count substantially, and look at the diagnostic plots available.  You may be in the situation that with the amount of data available to you, you may not need to specify a distribution.  I am hoping that the residuals converge in distribution to something near normal.  If this doesn't work, you'll probably need to look at other right-skewed distributions (lognormal maybe) or other PROCs to get a reasonable solution.

Or--see what happens with a conditional model:

proc glimmix data=procdata13 method=laplace;

      class code period_count;

         model proc_count=code period_count code*period_count/ dist=NB solution ddfm=bw;

     random period_count / type=ar(1) subject=code;

     nloptions maxiter=1000 gconv=1e-4 tech=nrridg;

     lsmeans code*period_count/ilink;

   run;

Steve Denham

jrbrauer
Fluorite | Level 6

You mentioned that you are treating your dependent variable as a count but that the observations are not necessarily whole numbers. How many cases are non-integers? It is possible you are running into some convergence/estimation problems due to applying the negative binomial distribution to a continuous variable.

Why do you have fractional observations? (How do you send a fractional bill to someone?) If it is from adjusting/correcting the original count, try modeling the original count variable and then adjusting for the correction using a covariate or an offset.

Good luck!

Jon

SteveDenham
Jade | Level 19

makes an important point here--to model this data with a negative binomial distribution probably misses the statistical process generating the data.  You just don't get fractional values without some sort of divisor/offset variable that is used to standardize things.  If you have that variable on your dataset, it's appropriate to use the natural log of the value as an offset variable, in which case the model statement would then become:

model proc_count=code period_count code*period_count/ dist=NB solution ddfm=bw offset=lntimevariable;

where lntimevariable is the log of the length of time that you are dividing by.  Now proc_count should be an integer count.  This could help convergence substantially, especially if you have a lot of zeroes and fractional values less than 1.  The more I think about this, the more I fear that you have a zero-inflated distribution of some sort.

Steve Denham

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

aisley,

I posted a response to your same question on StackOverflow  regression - SAS GLIMMIX subject estimates - Stack Overflow, and I'll copy it here since it appears you haven't seen it. Perhaps some of my comments will be useful in addition to those by Steve and Jon.

Maybe the positive slope is an actual feature of the data? What do you see if you plot billing_count versus period_count for each code?

Regarding the program, I have two suggestions.

(1) The use of type=AR(1) in

random intercept period_count / sub=code type=AR(1); 

forces the variance of the intercepts to be equal to the variance of the slopes. This constraint may be inconsistent with the data. AR(1) is not a sensible covariance structure for a random coefficients model. Try type=UN or type=UN(1).

(2) Drop

random _residual_; 

Its inclusion makes the model overspecified; the negative binomial distribution already has a scale parameter.

Another thing to consider is that a random coefficients model produces shrinkage estimators, such that estimates for individual codes are shrunk toward the overall solution: the estimates of slope that you obtain from the random coefficients model will not be equal to the estimates you would obtain from separate regressions for each code. Kreft et al. have a nicely intuitive presentation of this topic (see p14 here http://tinyurl.com/ns99ojh).

Including CODE as a fixed effects factor would address the shrinkage issue to some extent. But I am seriously doubtful about the appropriateness of including CODE as a fixed effects factor while also including CODE as a random subject effect. I'd think you would need to go one way or the other, but not both.

Susan

SteveDenham
Jade | Level 19

I am going to disagree a bit with here.  For a narrow inference space (estimates only for the codes in hand), I think specifying code as the subject should work fine in the random statement.  It should be the equivalent of a GEE model, such as could be fit with PROC GENMOD.  Actually, moving the whole estimation problem over to PROC GENMOD may be the best way to get around many of the issues being run into here.

Steve Denham

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Ah, yes, thanks, , I was still thinking about aisley’s first model that is a random coefficients model:

    proc glimmix data=dataset method=laplace;

    class code;

    model billing_count = period_count / dist=nb;

    random intercept period_count / subject=code type=un;

    estimate “Slope for Code 3” period_count | period_count / subject 0 0 1;

    run;

(I’m ignoring the issue of distributions, counts and offsets here for simplicity.) You would not use CODE in the MODEL statement and also as the SUBJECT in the RANDOM statement where the specification was “random intercept period_count”.

I totally agree with Steve that it is perfectly fine to use CODE as a fixed effects factor in the MODEL statement and as the SUBJECT in the RANDOM statement,

    proc glimmix data=dataset method=laplace;

    class code period_count;

    model billing_count=code period_count code*period_count / dist=NB;

    random period_count / type=ar(1) subject=code;

    run;

If the levels of CODE in aisley’s dataset are the complete set of codes of interest, then the CODE-as-fixed-effect-factor model seems like the better choice.

Susan

aisley
Calcite | Level 5

Thank you to everyone who has offered suggestions!

To answer some of the above questions- I am getting some fractional values because I am attempting to analyze medical billing data. For some procedures (like those involving anesthesia), billing is done in 15 minute increments (thus, we could end up with counts at 1.5 as a hour and a half of anesthetic administration). This is not generally the case, however. If I additionally log transformed these counts, would a lognormal distribution be better fit?

In regards to the concerns of zero inflation-This was absolutely a concern initially. However, I have filtered for a complete case analysis where there is observed data in all the time points.

I have tried fitting the data using proc genmod, glimmix, and mixed and am now confused about the relative merits of each. The following are examples of models which now successfully converge for my data set (using type=UN was preventing my model from converging so I reverted to AR(1)):

proc glimmix data=procdata14 plots=residualpanel(conditional marginal);

     class code;

     model proc_count=code period_count code*period_count / dist=lognormal solution ddfm=betwithin ;

     random period_count / sub=code type=AR(1);

run;

glimmix1.png glimmix2.png

proc genmod data=procdata14;

   class code;

   model  proc_count= period_count code code*period_count  /  dist=normal link=log ;

   repeated  subject=code / type=ar(1);

   output      out       = Residuals

                   resraw    = Resraw

                  stdreschi = Stdreschi

                  pred      = Pred;

run;

[The website prevented me from attaching the plots of these residuals]

SteveDenham
Jade | Level 19

Well, the GLIMMIX plots look pretty good, although it really looks like there is a nonlinear/quadratic effect that is not being fit in the model, based on the residual vs. predicted plot.

Be aware that there is a difference between a lognormal distribution, and a normal distribution with a log link.  For lognormal, the errors are assumed to be additive, while a normal with a log link essentially makes them multiplicative.  I suspect the plots from GENMOD may show this.

To put everything on the same basis, try running the following code in GLIMMIX, which would fit a normal distribution with a log link:

proc glimmix data=procdata14 plots=residualpanel(conditional marginal);

     class code;

     model proc_count=code period_count code*period_count / link=log solution ddfm=betwithin ;

     random period_count / sub=code type=AR(1);

run;

Steve Denham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 16 replies
  • 2822 views
  • 0 likes
  • 4 in conversation