Want to evaluate difference between survival of larvae on different feed types. There are 3 feed types, 8 blocks and survival (alive) was measured once a week for 8 weeks (Repeated measures). When I originally ran my data I failed to meet the assumptions for normality.
Data WholeP;
Infile datalines dlm="09"x;
input Week Trmt Block Alive;
datalines;
0 1 1 10
0 1 2 10
...;
run;
Proc glimmix data=WholeP;
class Block Week Trmt;
model Alive = Trmt|Week / ddfm=kr;
random Week / subject= Block type=cs residual;
lsmeans Trmt*Week / slicediff=Week adjust=tukey lines;
output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
ods output lsmeans=lsdata;
run;
I tried to log transform my data but NOTHING has increased the Shapiro value and others do not converge.
Proc glimmix data=WholeP;
class Block Week Trmt;
model Alive = Trmt|Week / dist =gamma link=log ddfm=kr;
*/- did not got results with dist = negbn;
*/- did not got results with dist = binomial;
*/- did not got results with dist = poisson;
*/- did not got results with dist = beta;
*/- got results with dist = gamma but LSMEANS lines not fully complete;
random Week / subject= Block type=cs residual;
output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
lsmeans Trmt*Week / ilink slicediff=Week adjust=tukey lines;
ods output lsmeans=lsdata;
run;
I would like to see if there is a statistical significance between feed types at week 8 for example. I am out of ideas on how to transform the data to meet the assumptions for normality. Is there something in my code I can change or try?
Actually, while the answer looks like something good, it has a problem with convergence and a nonpositive definite G matrix for these dates. The methods I tried had the same issue. So...
I fit the week/day variable with a spline that differed by trmt. The following is what worked remarkably well and converged rapidly:
proc glimmix data=WholeP;
class trmt block;
effect spl = spline(week);
model Alive = trmt spl*trmt/ s noint dist=poisson;
estimate 'trmt 1, week=7' trmt 1 trmt*spl [1,1 1] /ilink;
estimate 'trmt 2, week=7' trmt 0 1 trmt*spl [1,2 1]/ilink;
estimate 'trmt 3, week=7' trmt 0 0 1 trmt*spl [1,3 1]/ilink;
estimate 'Diff trmt1 v trmt2 at week=7 ' trmt 1 -1 trmt*spl [1,1 1] [-1,2 1];
estimate 'Diff trmt1 v trmt3 at week=7 ' trmt 1 0 -1 trmt*spl [1,1 1] [-1,3 1];
run;
No sense in applying the ilink to the differences, unless you are interested in "fold change", so they show as non-est in the output:
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Trmt 3 171 9.97 <.0001
spl*Trmt 17 171 14.94 <.0001
So, this worked. I tried using an integration method rather than the default pseudo-likelihood method used here, but the results were identical.
EDIT: Note that the code is correct, but the labels are not - they all should read week=0.
SteveDenham
Data does not have to be normally distributed to fit a linear regression model
Please read: https://blogs.sas.com/content/iml/2018/08/27/on-the-assumptions-and-misconceptions-of-linear-regress...
Are the residuals normally distributed?
Hi PaigeMiller,
Sorry for confusion when I say something has failed normality I am always referring to the residuals.
It looks like the ALIVE variable is the number of larvae that survived. This is a count variable, not a continuous variable. The rejection of the normality assumption is because your response is discrete, not continuous.
You might want to Google "how to model count data." Your model assumes that the response variable is continuous, but it is not. There are special models you can use for count data. You might want to start by reading the example in the GLIMMIX documentation for binomial data.
First, I want to support what @PaigeMiller said about residuals being normally distributed. For the non-Gaussian distributions you were examining, the check on normality in the residuals isn't as necessary, but it does let you know if the variance components are normally distributed.
Some other things to note: It appears the response variable is a count, which would imply that a Poisson or negative binomial distribution is a likely candidate, unless the counts get quite large, where the gamma treats the discrete values as continuous. No need to check binomial or beta, as they only have support on the zero to one interval.
Looking at your code, I don't see an NLOPTIONS statement. Many times for Poisson or negative binomial, you are going to need more than the default 20 iterations to get convergence. We routinely use NLOPTIONS=5000. The rest of the code looks like it should give you comparisons between the levels of Trmt at each Week, so I am a bit confused that the gamma distribution approach did not give you all of the lsmeans. Check the output for any messages that may be printed, as well as for WARNINGs or NOTEs in the log.
One more thing to consider is a generalized estimating equation approach using PROC GEE. Your current code is fitting a marginal model in GLIMMIX, so switching to PROC GEE (or GENMOD for a wider selection of possible distributions) may solve your problems. Note that the LSMEANS statement in GEE doesn't support the slicediff= option. LSMESTIMATE or SLICE statements should give you the specific within week comparisons you want.
SteveDenham
Hi @SteveDenham ,
GREAT answer. As I clarified with @PaigeMiller I meant that the residuals failed normality. I too have looked at poisson transformation but it does not converge. I have provided my entire code (Note the full data can be found in the attached text file) and the entire log output below. I haven't been able to evaluate any LSmeans with transformed data. Additionally, I am unfamiliar with where in the code to add the NLOPTIONS statement. I did some quick reading and one source said GLIMMIX would ignore a NLOPTIONS statement?
*/Note the variable Week should really say Day;
Data WholeP;
Infile datalines dlm="09"x;
input Week Trmt Block Alive;
datalines;
0 1 1 10
0 1 2 10
0 1 3 10
...
;
Run;
Proc glimmix data=WholeP;
class Block Week Trmt;
model Alive = Trmt|Week / dist =poisson link=log ddfm=kr;
random Week / subject= Block type=cs residual;
output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
lsmeans Trmt*Week / ilink slicediff=Trmt adjust=tukey lines;
ods output lsmeans=lsdata;
run;
*/ need to check model assumptions before moving forward;
/* Linearity of fixed effects - both as a scatter and a boxplot */
Proc sgplot data=second;
vbox smresid / group=Trmt datalabel;
Run;
Proc sgplot data=second;
vbox smresid / group=Week datalabel;
Run;
/* Homogeneity of effects */
Proc sgscatter data=second;
plot studentresid*(pred Trmt Week);
Run;
/* Q-Q plot and Shapiro-Wilk for normal distribution */
Proc univariate data=second normal plot;
var studentresid;
histogram studentresid / normal kernel;
Run;
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
NOTE: ODS statements in the SAS Studio environment may disable some output features.
69
70 */Note the variable Week should really say Day;
71 Data WholeP;
72 Infile datalines dlm="09"x;
73 input Week Trmt Block Alive;
74 datalines;
NOTE: The data set WORK.WHOLEP has 192 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 667.50k
OS Memory 24228.00k
Timestamp 10/26/2022 02:16:17 PM
Step Count 30 Switch Count 2
Page Faults 0
Page Reclaims 128
Page Swaps 0
Voluntary Context Switches 10
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 264
267 ;
268 Run;
269
270
271 Proc glimmix data=WholeP;
272 class Block Week Trmt;
273 model Alive = Trmt|Week / dist =poisson link=log ddfm=kr;
274 random Week / subject= Block type=cs residual;
275 output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
276 lsmeans Trmt*Week / ilink slicediff=Trmt adjust=tukey lines;
277 ods output lsmeans=lsdata;
278 run;
NOTE: With DDFM=SATTERTHWAITE or DDFM=KENWARDROGER, unadjusted p-values in tests are based on the degrees of freedom specific to
that comparison. P-values that are adjusted for multiplicity, however, are by default based on the denominator degrees of
freedom for the Type III test of the fixed effect. If you specify the ADJDFE=ROW option in the LSMEANS or LSMESTIMATE
statement, the adjusted p-values take into account the row-wise degrees of freedom.
NOTE: Did not converge.
WARNING: Output 'lsmeans' was not created. Make sure that the output object name, label, or path is spelled correctly. Also,
verify that the appropriate procedure options are used to produce the requested output object. For example, verify that
the NOPRINT option is not used.
NOTE: The data set WORK.SECOND has 192 observations and 9 variables.
NOTE: PROCEDURE GLIMMIX used (Total process time):
real time 0.25 seconds
user cpu time 0.26 seconds
system cpu time 0.00 seconds
memory 4451.57k
OS Memory 27896.00k
Timestamp 10/26/2022 02:16:17 PM
Step Count 31 Switch Count 9
Page Faults 0
Page Reclaims 2426
Page Swaps 0
Voluntary Context Switches 58
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 1088
279
280 */ need to check model assumptions before moving forward;
281 /* Linearity of fixed effects - both as a scatter and a boxplot */
282 Proc sgplot data=second;
283 vbox smresid / group=Trmt datalabel;
284 Run;
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 1.68 seconds
user cpu time 0.05 seconds
system cpu time 0.01 seconds
memory 8125.96k
OS Memory 33324.00k
Timestamp 10/26/2022 02:16:19 PM
Step Count 32 Switch Count 1
Page Faults 0
Page Reclaims 2703
Page Swaps 0
Voluntary Context Switches 267
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 664
WARNING: There are insufficient nonmissing observations to create a boxplot.
WARNING: The BoxPlot statement named 'VBOX' will not be drawn because one or more of the required arguments were not supplied.
WARNING: A blank graph is produced. For possible causes, see the graphics template language documentation.
NOTE: There were 192 observations read from the data set WORK.SECOND.
285
286 Proc sgplot data=second;
287 vbox smresid / group=Week datalabel;
288 Run;
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.07 seconds
user cpu time 0.04 seconds
system cpu time 0.00 seconds
memory 2223.75k
OS Memory 33324.00k
Timestamp 10/26/2022 02:16:19 PM
Step Count 33 Switch Count 1
Page Faults 0
Page Reclaims 355
Page Swaps 0
Voluntary Context Switches 108
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 400
WARNING: There are insufficient nonmissing observations to create a boxplot.
WARNING: The BoxPlot statement named 'VBOX' will not be drawn because one or more of the required arguments were not supplied.
WARNING: A blank graph is produced. For possible causes, see the graphics template language documentation.
NOTE: There were 192 observations read from the data set WORK.SECOND.
289
290 /* Homogeneity of effects */
291 Proc sgscatter data=second;
292 plot studentresid*(pred Trmt Week);
293 Run;
NOTE: PROCEDURE SGSCATTER used (Total process time):
real time 0.38 seconds
user cpu time 0.04 seconds
system cpu time 0.01 seconds
memory 2156.87k
OS Memory 33624.00k
Timestamp 10/26/2022 02:16:19 PM
Step Count 34 Switch Count 1
Page Faults 0
Page Reclaims 399
Page Swaps 0
Voluntary Context Switches 163
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 320
WARNING: X=PRED is invalid. The option expects that the column not contain all missing values.
WARNING: Y=STUDENTRESID is invalid. The option expects that the column not contain all missing values.
WARNING: Y=STUDENTRESID is invalid. The option expects that the column not contain all missing values.
NOTE: There were 192 observations read from the data set WORK.SECOND.
294
295 /* Q-Q plot and Shapiro-Wilk for normal distribution */
296 Proc univariate data=second normal plot;
297 var studentresid;
298 histogram studentresid / normal kernel;
299 Run;
WARNING: Insufficient number of nonmissing observations to create a histogram for studentresid.
NOTE: PROCEDURE UNIVARIATE used (Total process time):
real time 0.01 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
memory 810.00k
OS Memory 32420.00k
Timestamp 10/26/2022 02:16:19 PM
Step Count 35 Switch Count 0
Page Faults 0
Page Reclaims 179
Page Swaps 0
Voluntary Context Switches 1
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 0
300
301 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
311
I apologize for the long post but also for reference here are snippets of results and log output from gamma (using the exact same code as above):
Proc glimmix data=WholeP;
272 class Block Week Trmt;
273 model Alive = Trmt|Week / dist =gamma link=log ddfm=kr;
274 random Week / subject= Block type=cs residual;
275 output out=second predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;
276 lsmeans Trmt*Week / ilink slicediff=Week adjust=tukey lines;
277 ods output lsmeans=lsdata;
278 run;
NOTE: With DDFM=SATTERTHWAITE or DDFM=KENWARDROGER, unadjusted p-values in tests are based on the degrees of freedom specific to
that comparison. P-values that are adjusted for multiplicity, however, are by default based on the denominator degrees of
freedom for the Type III test of the fixed effect. If you specify the ADJDFE=ROW option in the LSMEANS or LSMESTIMATE
statement, the adjusted p-values take into account the row-wise degrees of freedom.
NOTE: Some observations are not used in the analysis because of: zero or negative response (n=89).
NOTE: Convergence criterion (PCONV=1.11022E-8) satisfied.
ERROR: LINES display is not produced for the Week*Trmt effect because p-values are not available for all least-squares means
differences.
NOTE: The data set WORK.LSDATA has 24 observations and 10 variables.
NOTE: The data set WORK.SECOND has 192 observations and 9 variables.
NOTE: PROCEDURE GLIMMIX used (Total process time):
real time 0.26 seconds
user cpu time 0.26 seconds
system cpu time 0.00 seconds
memory 5178.09k
OS Memory 30204.00k
Timestamp 10/26/2022 02:31:34 PM
Step Count 53 Switch Count 11
Page Faults 0
Page Reclaims 1548
Page Swaps 0
Voluntary Context Switches 79
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 1384
Actually, while the answer looks like something good, it has a problem with convergence and a nonpositive definite G matrix for these dates. The methods I tried had the same issue. So...
I fit the week/day variable with a spline that differed by trmt. The following is what worked remarkably well and converged rapidly:
proc glimmix data=WholeP;
class trmt block;
effect spl = spline(week);
model Alive = trmt spl*trmt/ s noint dist=poisson;
estimate 'trmt 1, week=7' trmt 1 trmt*spl [1,1 1] /ilink;
estimate 'trmt 2, week=7' trmt 0 1 trmt*spl [1,2 1]/ilink;
estimate 'trmt 3, week=7' trmt 0 0 1 trmt*spl [1,3 1]/ilink;
estimate 'Diff trmt1 v trmt2 at week=7 ' trmt 1 -1 trmt*spl [1,1 1] [-1,2 1];
estimate 'Diff trmt1 v trmt3 at week=7 ' trmt 1 0 -1 trmt*spl [1,1 1] [-1,3 1];
run;
No sense in applying the ilink to the differences, unless you are interested in "fold change", so they show as non-est in the output:
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Trmt 3 171 9.97 <.0001
spl*Trmt 17 171 14.94 <.0001
So, this worked. I tried using an integration method rather than the default pseudo-likelihood method used here, but the results were identical.
EDIT: Note that the code is correct, but the labels are not - they all should read week=0.
SteveDenham
Thank you again @SteveDenham. This is a great solution.
I have clarification questions. The estimate statements written, how do you change them to evaluate the difference at different time periods. For example at week 49 (the last time frame in my data)? I believe it has something to do with changing the values in the brackets but I don't know how to know what to change them to.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.