Solved: Re: Non-normal data; PROC GLIMMIX

BUGstats · Posted 10-26-2022 02:47 AM

Want to evaluate difference between survival of larvae on different feed types. There are 3 feed types, 8 blocks and survival (alive) was measured once a week for 8 weeks (Repeated measures). When I originally ran my data I failed to meet the assumptions for normality.

Data WholeP;
Infile datalines dlm="09"x;
input Week Trmt Block Alive;
datalines;
0 1 1 10
0 1 2 10

...;

run;

Proc glimmix data=WholeP;
class Block Week Trmt;
model Alive = Trmt|Week / ddfm=kr;
random Week / subject= Block type=cs residual;
lsmeans Trmt*Week / slicediff=Week adjust=tukey lines;
output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
ods output lsmeans=lsdata;
run;

I tried to log transform my data but NOTHING has increased the Shapiro value and others do not converge.

Proc glimmix data=WholeP;
class Block Week Trmt;
model Alive = Trmt|Week / dist =gamma link=log ddfm=kr;
*/- did not got results with dist = negbn;
*/- did not got results with dist = binomial;
*/- did not got results with dist = poisson;
*/- did not got results with dist = beta;
*/- got results with dist = gamma but LSMEANS lines not fully complete;
random Week / subject= Block type=cs residual;
output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
lsmeans Trmt*Week / ilink slicediff=Week adjust=tukey lines;
ods output lsmeans=lsdata;
run;

I would like to see if there is a statistical significance between feed types at week 8 for example. I am out of ideas on how to transform the data to meet the assumptions for normality. Is there something in my code I can change or try?

SteveDenham · Posted 10-26-2022 02:21 PM

Actually, while the answer looks like something good, it has a problem with convergence and a nonpositive definite G matrix for these dates. The methods I tried had the same issue. So...

I fit the week/day variable with a spline that differed by trmt. The following is what worked remarkably well and converged rapidly:

proc glimmix data=WholeP;
   class trmt block;
   effect spl = spline(week);
   model Alive = trmt spl*trmt/ s noint dist=poisson;
   estimate 'trmt 1, week=7' trmt 1    trmt*spl [1,1 1] /ilink;
   estimate 'trmt 2, week=7' trmt 0  1 trmt*spl [1,2 1]/ilink;
   estimate 'trmt 3, week=7' trmt 0  0 1 trmt*spl [1,3 1]/ilink;
   estimate 'Diff trmt1 v trmt2 at week=7 ' trmt 1 -1 trmt*spl [1,1 1] [-1,2 1];
   estimate 'Diff trmt1 v trmt3 at week=7 ' trmt 1 0 -1 trmt*spl [1,1 1] [-1,3 1];
run;

No sense in applying the ilink to the differences, unless you are interested in "fold change", so they show as non-est in the output:

Type III Tests of Fixed Effects
Effect	Num DF	Den DF	F Value	Pr > F
Trmt	     3	   171	9.97	<.0001
spl*Trmt	17	   171	14.94	<.0001

So, this worked. I tried using an integration method rather than the default pseudo-likelihood method used here, but the results were identical.

EDIT: Note that the code is correct, but the labels are not - they all should read week=0.

SteveDenham

View solution in original post

PaigeMiller · Posted 10-26-2022 06:13 AM

Data does not have to be normally distributed to fit a linear regression model

Please read: https://blogs.sas.com/content/iml/2018/08/27/on-the-assumptions-and-misconceptions-of-linear-regress...

Are the residuals normally distributed?

--
Paige Miller

BUGstats · Posted 10-26-2022 09:55 AM

Hi PaigeMiller,

Sorry for confusion when I say something has failed normality I am always referring to the residuals.

Ksharp · Posted 10-26-2022 08:21 AM

Calling @StatDave @SteveDenham @lvm

Rick_SAS · Posted 10-26-2022 09:37 AM

It looks like the ALIVE variable is the number of larvae that survived. This is a count variable, not a continuous variable. The rejection of the normality assumption is because your response is discrete, not continuous.

You might want to Google "how to model count data." Your model assumes that the response variable is continuous, but it is not. There are special models you can use for count data. You might want to start by reading the example in the GLIMMIX documentation for binomial data.

SteveDenham · Posted 10-26-2022 09:38 AM

First, I want to support what @PaigeMiller said about residuals being normally distributed. For the non-Gaussian distributions you were examining, the check on normality in the residuals isn't as necessary, but it does let you know if the variance components are normally distributed.

Some other things to note: It appears the response variable is a count, which would imply that a Poisson or negative binomial distribution is a likely candidate, unless the counts get quite large, where the gamma treats the discrete values as continuous. No need to check binomial or beta, as they only have support on the zero to one interval.

Looking at your code, I don't see an NLOPTIONS statement. Many times for Poisson or negative binomial, you are going to need more than the default 20 iterations to get convergence. We routinely use NLOPTIONS=5000. The rest of the code looks like it should give you comparisons between the levels of Trmt at each Week, so I am a bit confused that the gamma distribution approach did not give you all of the lsmeans. Check the output for any messages that may be printed, as well as for WARNINGs or NOTEs in the log.

One more thing to consider is a generalized estimating equation approach using PROC GEE. Your current code is fitting a marginal model in GLIMMIX, so switching to PROC GEE (or GENMOD for a wider selection of possible distributions) may solve your problems. Note that the LSMEANS statement in GEE doesn't support the slicediff= option. LSMESTIMATE or SLICE statements should give you the specific within week comparisons you want.

SteveDenham

BUGstats · Posted 10-26-2022 10:37 AM

Hi @SteveDenham ,

GREAT answer. As I clarified with @PaigeMiller I meant that the residuals failed normality. I too have looked at poisson transformation but it does not converge. I have provided my entire code (Note the full data can be found in the attached text file) and the entire log output below. I haven't been able to evaluate any LSmeans with transformed data. Additionally, I am unfamiliar with where in the code to add the NLOPTIONS statement. I did some quick reading and one source said GLIMMIX would ignore a NLOPTIONS statement?

*/Note the variable Week should really say Day;
Data WholeP;
Infile datalines dlm="09"x;
input Week Trmt Block Alive;
datalines;
0 1 1 10
0 1 2 10
0 1 3 10
...
;
Run;

Proc glimmix data=WholeP;
class Block Week Trmt;
model Alive = Trmt|Week / dist =poisson link=log ddfm=kr;
random Week / subject= Block type=cs residual;
  output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
  lsmeans Trmt*Week / ilink slicediff=Trmt adjust=tukey lines;
  ods output lsmeans=lsdata;
run;

*/ need to check model assumptions before moving forward;
/*  Linearity of fixed effects - both as a scatter and a boxplot */
Proc sgplot data=second;
  vbox smresid / group=Trmt datalabel;
Run;

Proc sgplot data=second;
  vbox smresid / group=Week datalabel;
Run;

/* Homogeneity of effects */
Proc sgscatter data=second;
  plot studentresid*(pred Trmt Week);
Run;

/* Q-Q plot and Shapiro-Wilk for normal distribution */
Proc univariate data=second normal plot;
  var studentresid;
  histogram studentresid / normal kernel;
Run;

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 NOTE: ODS statements in the SAS Studio environment may disable some output features.
 69         
 70         */Note the variable Week should really say Day;
 71         Data WholeP;
 72         Infile datalines dlm="09"x;
 73         input Week Trmt Block Alive;
 74         datalines;
 
 NOTE: The data set WORK.WHOLEP has 192 observations and 4 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              667.50k
       OS Memory           24228.00k
       Timestamp           10/26/2022 02:16:17 PM
       Step Count                        30  Switch Count  2
       Page Faults                       0
       Page Reclaims                     128
       Page Swaps                        0
       Voluntary Context Switches        10
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           264
       
 
 267        ;
 268        Run;
 269        
 270        
 271        Proc glimmix data=WholeP;
 272        class Block Week Trmt;
 273        model Alive = Trmt|Week / dist =poisson link=log ddfm=kr;
 274        random Week / subject= Block type=cs residual;
 275          output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
 276          lsmeans Trmt*Week / ilink slicediff=Trmt adjust=tukey lines;
 277          ods output lsmeans=lsdata;
 278        run;
 
 NOTE: With DDFM=SATTERTHWAITE or DDFM=KENWARDROGER, unadjusted p-values in tests are based on the degrees of freedom specific to 
       that comparison. P-values that are adjusted for multiplicity, however, are by default based on the denominator degrees of 
       freedom for the Type III test of the fixed effect. If you specify the ADJDFE=ROW option in the LSMEANS or LSMESTIMATE 
       statement, the adjusted p-values take into account the row-wise degrees of freedom.
 
 
 NOTE: Did not converge.
 WARNING: Output 'lsmeans' was not created.  Make sure that the output object name, label, or path is spelled correctly.  Also, 
          verify that the appropriate procedure options are used to produce the requested output object.  For example, verify that 
          the NOPRINT option is not used.
 NOTE: The data set WORK.SECOND has 192 observations and 9 variables.
 NOTE: PROCEDURE GLIMMIX used (Total process time):
       real time           0.25 seconds
       user cpu time       0.26 seconds
       system cpu time     0.00 seconds
       memory              4451.57k
       OS Memory           27896.00k
       Timestamp           10/26/2022 02:16:17 PM
       Step Count                        31  Switch Count  9
       Page Faults                       0
       Page Reclaims                     2426
       Page Swaps                        0
       Voluntary Context Switches        58
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           1088
       
 
 279        
 280        */ need to check model assumptions before moving forward;
 281        /*  Linearity of fixed effects - both as a scatter and a boxplot */
 282        Proc sgplot data=second;
 283          vbox smresid / group=Trmt datalabel;
 284        Run;
 
 NOTE: PROCEDURE SGPLOT used (Total process time):
       real time           1.68 seconds
       user cpu time       0.05 seconds
       system cpu time     0.01 seconds
       memory              8125.96k
       OS Memory           33324.00k
       Timestamp           10/26/2022 02:16:19 PM
       Step Count                        32  Switch Count  1
       Page Faults                       0
       Page Reclaims                     2703
       Page Swaps                        0
       Voluntary Context Switches        267
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           664
       
 WARNING: There are insufficient nonmissing observations to create a boxplot.
 WARNING: The BoxPlot statement named 'VBOX' will not be drawn because one or more of the required arguments were not supplied.
 WARNING: A blank graph is produced. For possible causes, see the graphics template language documentation.
 NOTE: There were 192 observations read from the data set WORK.SECOND.
 
 285        
 286        Proc sgplot data=second;
 287          vbox smresid / group=Week datalabel;
 288        Run;
 
 NOTE: PROCEDURE SGPLOT used (Total process time):
       real time           0.07 seconds
       user cpu time       0.04 seconds
       system cpu time     0.00 seconds
       memory              2223.75k
       OS Memory           33324.00k
       Timestamp           10/26/2022 02:16:19 PM
       Step Count                        33  Switch Count  1
       Page Faults                       0
       Page Reclaims                     355
       Page Swaps                        0
       Voluntary Context Switches        108
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           400
       
 WARNING: There are insufficient nonmissing observations to create a boxplot.
 WARNING: The BoxPlot statement named 'VBOX' will not be drawn because one or more of the required arguments were not supplied.
 WARNING: A blank graph is produced. For possible causes, see the graphics template language documentation.
 NOTE: There were 192 observations read from the data set WORK.SECOND.
 
 289        
 290        /* Homogeneity of effects */
 291        Proc sgscatter data=second;
 292          plot studentresid*(pred Trmt Week);
 293        Run;
 
 NOTE: PROCEDURE SGSCATTER used (Total process time):
       real time           0.38 seconds
       user cpu time       0.04 seconds
       system cpu time     0.01 seconds
       memory              2156.87k
       OS Memory           33624.00k
       Timestamp           10/26/2022 02:16:19 PM
       Step Count                        34  Switch Count  1
       Page Faults                       0
       Page Reclaims                     399
       Page Swaps                        0
       Voluntary Context Switches        163
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           320
       
 WARNING: X=PRED is invalid. The option expects that the column not contain all missing values.
 WARNING: Y=STUDENTRESID is invalid. The option expects that the column not contain all missing values.
 WARNING: Y=STUDENTRESID is invalid. The option expects that the column not contain all missing values.
 NOTE: There were 192 observations read from the data set WORK.SECOND.
 
 294        
 295        /* Q-Q plot and Shapiro-Wilk for normal distribution */
 296        Proc univariate data=second normal plot;
 297          var studentresid;
 298          histogram studentresid / normal kernel;
 299        Run;
 
 WARNING: Insufficient number of nonmissing observations to create a histogram for studentresid.
 NOTE: PROCEDURE UNIVARIATE used (Total process time):
       real time           0.01 seconds
       user cpu time       0.01 seconds
       system cpu time     0.00 seconds
       memory              810.00k
       OS Memory           32420.00k
       Timestamp           10/26/2022 02:16:19 PM
       Step Count                        35  Switch Count  0
       Page Faults                       0
       Page Reclaims                     179
       Page Swaps                        0
       Voluntary Context Switches        1
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           0
       
 
 300        
 301        OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 311

I apologize for the long post but also for reference here are snippets of results and log output from gamma (using the exact same code as above):

 Proc glimmix data=WholeP;
 272        class Block Week Trmt;
 273        model Alive = Trmt|Week / dist =gamma link=log ddfm=kr;
 274        random Week / subject= Block type=cs residual;
 275          output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
 276          lsmeans Trmt*Week / ilink slicediff=Week adjust=tukey lines;
 277          ods output lsmeans=lsdata;
 278        run;
 
 NOTE: With DDFM=SATTERTHWAITE or DDFM=KENWARDROGER, unadjusted p-values in tests are based on the degrees of freedom specific to 
       that comparison. P-values that are adjusted for multiplicity, however, are by default based on the denominator degrees of 
       freedom for the Type III test of the fixed effect. If you specify the ADJDFE=ROW option in the LSMEANS or LSMESTIMATE 
       statement, the adjusted p-values take into account the row-wise degrees of freedom.
 
 
 NOTE: Some observations are not used in the analysis because of: zero or negative response (n=89).
 NOTE: Convergence criterion (PCONV=1.11022E-8) satisfied.
 ERROR: LINES display is not produced for the Week*Trmt effect because p-values are not available for all least-squares means 
        differences.
 NOTE: The data set WORK.LSDATA has 24 observations and 10 variables.
 NOTE: The data set WORK.SECOND has 192 observations and 9 variables.
 NOTE: PROCEDURE GLIMMIX used (Total process time):
       real time           0.26 seconds
       user cpu time       0.26 seconds
       system cpu time     0.00 seconds
       memory              5178.09k
       OS Memory           30204.00k
       Timestamp           10/26/2022 02:31:34 PM
       Step Count                        53  Switch Count  11
       Page Faults                       0
       Page Reclaims                     1548
       Page Swaps                        0
       Voluntary Context Switches        79
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           1384

SteveDenham · Posted 10-26-2022 02:21 PM

Actually, while the answer looks like something good, it has a problem with convergence and a nonpositive definite G matrix for these dates. The methods I tried had the same issue. So...

I fit the week/day variable with a spline that differed by trmt. The following is what worked remarkably well and converged rapidly:

proc glimmix data=WholeP;
   class trmt block;
   effect spl = spline(week);
   model Alive = trmt spl*trmt/ s noint dist=poisson;
   estimate 'trmt 1, week=7' trmt 1    trmt*spl [1,1 1] /ilink;
   estimate 'trmt 2, week=7' trmt 0  1 trmt*spl [1,2 1]/ilink;
   estimate 'trmt 3, week=7' trmt 0  0 1 trmt*spl [1,3 1]/ilink;
   estimate 'Diff trmt1 v trmt2 at week=7 ' trmt 1 -1 trmt*spl [1,1 1] [-1,2 1];
   estimate 'Diff trmt1 v trmt3 at week=7 ' trmt 1 0 -1 trmt*spl [1,1 1] [-1,3 1];
run;

No sense in applying the ilink to the differences, unless you are interested in "fold change", so they show as non-est in the output:

Type III Tests of Fixed Effects
Effect	Num DF	Den DF	F Value	Pr > F
Trmt	     3	   171	9.97	<.0001
spl*Trmt	17	   171	14.94	<.0001

So, this worked. I tried using an integration method rather than the default pseudo-likelihood method used here, but the results were identical.

EDIT: Note that the code is correct, but the labels are not - they all should read week=0.

SteveDenham

BUGstats · Posted 10-26-2022 02:56 PM

Thank you again @SteveDenham. This is a great solution.

I have clarification questions. The estimate statements written, how do you change them to evaluate the difference at different time periods. For example at week 49 (the last time frame in my data)? I believe it has something to do with changing the values in the brackets but I don't know how to know what to change them to.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away