Dear SAS experts
I am a non-statistician who is seeking some help with getting basic syntax running.
I am trying to run a model using proc mixed, but I appear to be having an issue when using lsmeans.
First, I have created restricted cubic basis functions (4 new variables; spl1-spl4) based on the variable days. Days is a continuous variable that includes days since 1st of January 2020 to 31st of December 2021.
For each value of days, there is a measure of Hba1c. Hba1c is a clinical measure of blood sugar levels during the last three months. It varies in values from 7-173.
To begin with (just to get the basic syntax working), I want to model the splines of days and the variable sex and estimate the general between-sex difference in mean hba1c. I am using a mixed-effect model to take into account that there are multiple measurements of hba1c for each individual (the amount varies between subjects).
I have written the following syntax in proc mixed:
proc mixed data=dataset_name;
class pseudo_id sex;
model hba1c=spl: sex sex*spl1 sex*spl2 sex*spl3 sex*spl4 / solution;
random intercept days / subject = pseudo_id type=un;
lsmeans sex / pdiff;
run;
The log output is:
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='7. Hba1c_analysis';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTPATHHOST='';
7 %LET _CLIENTPROJECTNAME='';
8 %LET _SASPROGRAMFILE='T:\ouh\afd\covid_forskningsartikel\Programfiler og output\7. Hba1c_analysis.sas';
9 %LET _SASPROGRAMFILEHOST='L111353';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HtmlBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HtmlBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
SYMBOLGEN: Macro variable SASWORKLOCATION resolves to "E:\Work\_TD2300848_SRVESBAPPSAS34V_\Prc2/"
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26 proc mixed data=hba1c_sdco6;
27 class pseudo_id køn_20;
28 model visres_numerisk=spl: køn_20 køn_20*spl1 køn_20*spl2 køn_20*spl3 køn_20*spl4 / solution outp=pred6r outpm = pred6f;
29 random intercept prvtdato_num / subject = pseudo_id type=un;
30 lsmeans køn_20 / pdiff at means;
31 run;
NOTE: 203 observations are not included because of missing values.
NOTE: Convergence criteria met.
MVA_DSIO.OPEN_CLOSE| _DISARM| STOP| _DISARM| 2023-03-03T11:49:59,790+01:00| _DISARM| WorkspaceServer| _DISARM| | _DISARM| |
_DISARM| | _DISARM| 28151808| _DISARM| 11| _DISARM| 18| _DISARM| 194607184| _DISARM| 3544470331| _DISARM| | _DISARM| | _DISARM| |
_DISARM| | _DISARM| | _DISARM| | _ENDDISARM
NOTE: The data set WORK.PRED6R has 22721 observations and 26 variables.
NOTE: Compressing data set WORK.PRED6R decreased size by 27.40 percent.
Compressed is 53 pages; un-compressed would require 73 pages.
MVA_DSIO.OPEN_CLOSE| _DISARM| STOP| _DISARM| 2023-03-03T11:49:59,790+01:00| _DISARM| WorkspaceServer| _DISARM| | _DISARM| |
_DISARM| | _DISARM| 28151808| _DISARM| 11| _DISARM| 18| _DISARM| 194936189| _DISARM| 3544799946| _DISARM| | _DISARM| | _DISARM| |
_DISARM| | _DISARM| | _DISARM| | _ENDDISARM
NOTE: The data set WORK.PRED6F has 22721 observations and 26 variables.
NOTE: Compressing data set WORK.PRED6F decreased size by 27.40 percent.
Compressed is 53 pages; un-compressed would require 73 pages.
MVA_DSIO.OPEN_CLOSE| _DISARM| STOP| _DISARM| 2023-03-03T11:49:59,790+01:00| _DISARM| WorkspaceServer| _DISARM| | _DISARM| |
_DISARM| | _DISARM| 28151808| _DISARM| 11| _DISARM| 18| _DISARM| 195132996| _DISARM| 3544997465| _DISARM| | _DISARM| | _DISARM| |
_DISARM| | _DISARM| | _DISARM| | _ENDDISARM
PROCEDURE| _DISARM| STOP| _DISARM| 2023-03-03T11:49:59,790+01:00| _DISARM| WorkspaceServer| _DISARM| | _DISARM| | _DISARM|
| _DISARM| 28151808| _DISARM| 11| _DISARM| 18| _DISARM| 195398608| _DISARM| 3544999001| _DISARM| | _DISARM| | _DISARM| | _DISARM| |
_DISARM| | _DISARM| | _ENDDISARM
NOTE: PROCEDURE MIXED used (Total process time):
real time 55.48 seconds
cpu time 55.43 seconds
2 The SAS System 10:41 Friday, March 3, 2023
32
33 GOPTIONS NOACCESSIBLE;
34 %LET _CLIENTTASKLABEL=;
35 %LET _CLIENTPROCESSFLOWNAME=;
36 %LET _CLIENTPROJECTPATH=;
37 %LET _CLIENTPROJECTPATHHOST=;
38 %LET _CLIENTPROJECTNAME=;
39 %LET _SASPROGRAMFILE=;
40 %LET _SASPROGRAMFILEHOST=;
41
42 ;*';*";*/;quit;run;
43 ODS _ALL_ CLOSE;
44
45
46 QUIT; RUN;
47
The results are:
Dimensions |
|
Covariance Parameters |
4 |
Columns in X |
15 |
Columns in Z per Subject |
2 |
Subjects |
3434 |
Max Obs per Subject |
27 |
Number of Observations |
|
Number of Observations Read |
22721 |
Number of Observations Used |
22518 |
Number of Observations Not Used |
203 |
Iteration History |
|||
Iteration |
Evaluations |
-2 Res Log Like |
Criterion |
0 |
1 |
178669.59663948 |
|
1 |
3 |
161030.12170074 |
0.00065048 |
2 |
1 |
160985.21896392 |
0.00005265 |
3 |
1 |
160981.89111021 |
0.00000044 |
4 |
1 |
160981.86465805 |
0.00000000 |
Convergence criteria met. |
Covariance Parameter Estimates |
||
Cov Parm |
Subject |
Estimate |
UN(1,1) |
pseudo_id |
137.38 |
UN(2,1) |
pseudo_id |
-0.07511 |
UN(2,2) |
pseudo_id |
0.000190 |
Residual |
|
42.5146 |
Fit Statistics |
|
-2 Res Log Likelihood |
160981.9 |
AIC (Smaller is Better) |
160989.9 |
AICC (Smaller is Better) |
160989.9 |
BIC (Smaller is Better) |
161014.4 |
Null Model Likelihood Ratio Test |
||
DF |
Chi-Square |
Pr > ChiSq |
3 |
17687.73 |
<.0001 |
Solution for Fixed Effects |
||||||
Effect |
køn_20 |
Estimate |
Standard |
DF |
t Value |
Pr > |t| |
Intercept |
|
61.0501 |
0.3376 |
3432 |
180.82 |
<.0001 |
spl1 |
|
-0.01390 |
0.001344 |
16E3 |
-10.34 |
<.0001 |
spl2 |
|
0.000203 |
0.000021 |
16E3 |
9.66 |
<.0001 |
spl3 |
|
-0.00050 |
0.000054 |
16E3 |
-9.33 |
<.0001 |
spl4 |
|
0.000442 |
0.000050 |
16E3 |
8.78 |
<.0001 |
Sex |
Female |
0.2017 |
0.5318 |
16E3 |
0.38 |
0.7045 |
Sex |
Male |
0 |
. |
. |
. |
. |
spl1*Sex |
Female |
0.01093 |
0.002127 |
16E3 |
5.14 |
<.0001 |
spl1*Sex |
Male |
0 |
. |
. |
. |
. |
spl2*Sex |
Female |
-0.00019 |
0.000033 |
16E3 |
-5.56 |
<.0001 |
spl2*Sex |
Male |
0 |
. |
. |
. |
. |
spl3*Sex |
Female |
0.000450 |
0.000085 |
16E3 |
5.26 |
<.0001 |
spl3*Sex |
Male |
0 |
. |
. |
. |
. |
spl4*Sex |
Female |
-0.00039 |
0.000080 |
16E3 |
-4.87 |
<.0001 |
spl4*Sex |
Male |
0 |
. |
. |
. |
. |
Type 3 Tests of Fixed Effects |
||||
Effect |
Num DF |
Den DF |
F Value |
Pr > F |
spl1 |
1 |
16E3 |
62.99 |
<.0001 |
spl2 |
1 |
16E3 |
93.40 |
<.0001 |
spl3 |
1 |
16E3 |
87.02 |
<.0001 |
spl4 |
1 |
16E3 |
77.01 |
<.0001 |
Sex |
1 |
16E3 |
0.14 |
0.7045 |
spl1*Sex |
1 |
16E3 |
26.40 |
<.0001 |
spl2*Sex |
1 |
16E3 |
30.88 |
<.0001 |
spl3*Sex |
1 |
16E3 |
27.70 |
<.0001 |
spl4*Sex |
1 |
16E3 |
23.68 |
<.0001 |
Least Squares Means |
|||||||||||
Effect |
Sex |
spl1 |
spl2 |
spl3 |
spl4 |
prvtdato_num |
Estimate |
Standard |
DF |
t Value |
Pr > |t| |
Sex |
Female |
407.07 |
105217 |
52728 |
19728 |
407.07 |
60.1981 |
0.2895 |
16E3 |
207.93 |
<.0001 |
Sex |
Male |
407.07 |
105217 |
52728 |
19728 |
407.07 |
Non-est |
. |
. |
. |
. |
Differences of Least Squares Means |
||||||||||||
Effect |
Sex |
_Sex |
spl1 |
spl2 |
spl3 |
spl4 |
prvtdato_num |
Estimate |
Standard |
DF |
t Value |
Pr > |t| |
Sex |
Female |
Male |
407.07 |
105217 |
52728 |
19728 |
407.07 |
Non-est |
. |
. |
. |
. |
As you can see there is no estimate for males, and thus the differences between the sexes cannot be estimated.
Lsmeans produces an output for males when I only include the interaction: Sex*spl1 (but not for the other interactions).
Can anyone help me solve this issue?
Thank you
Finding the cause of the non-estimable lsmeans will be difficult without the data. You can add the /E option to the LSMEANS statement to see the linear combination used to estimate the problematic lsmean.
You can let GLIMMIX do the dirty work of forming the spline basis using the EFFECT statement as in this example
Fitting random coefficients on top of the spline fit might be more than you can do with this data. If you do want those random coefficients for each subject, do you want those on the spline effect rather than a linear effect of days? That might be more than the data can handle, too. You could try fitting just the random intercept and see if that does better for you.
Dear Statsman
I much appreciate your reply.
1. "You can let GLIMMIX do the dirty work of forming the spline basis using the EFFECT statement as in this example"
I am not sure exactly what you mean by this. The EFFECT statement is not supported by Proc Mixed. Moreover, would it make a difference if it was? On another note I am actually using the EFFECT statement in proc glmselect to create the spline basis:
proc glmselect data=dataset_withoutsplines outdesign(addinputvars fullmodel)=dataset_withsplines;
effect spl = spline(hba1c / naturalcubic basis=tpf(noint) knotmethod=percentiles(5)); model hba1c = spl / selection=none;
quit;
2. a) "Fitting random coefficients on top of the spline fit might be more than you can do with this data." and
b) "You could try fitting just the random intercept and see if that does better for you."
Right, I understand that the model might be too complex given the data. Regarding 2b), even when I completely remove the random statement I get the same outcome of "Non-est". This suggest to be that there is an issue with the interactions of sex and spl2-spl4. As I mentioned in my original post I don't get "Non-est" when I only include the interaction spl1*sex. However, I do not know if this is an appropriate model.
3. "If you do want those random coefficients for each subject, do you want those on the spline effect rather than a linear effect of days?"
Are you suggesting that I could include all the spline variables here?
random intercept SPLINE VARIABLES / subject = pseudo_id type=un;
I have not considered this. Could one argue for and against doing such?
Best regards
Often, a Least Squares Mean is non-estimable because you have cells in your experiment where there is no data. For example, if it happens that the interaction spl2*sex has no data in one of the cells, then you would get the Non-Est. And as said by @StatsMan , the model you have chosen to fit "might be more than you can do with this data."
Dear PaigeMiller
Thank you very much for your reply.
"Often, a Least Squares Mean is non-estimable because you have cells in your experiment where there is no data."
Missing data in general is not an issue in this dataset, but their might be a specific issue relating to the interaction terms. Is there some simple way to check for this? As far as I see it, there are four spline variables and the first spline variable corresponds to the variable Days - from which the spline variables are derived - in a 1:1 fashion. As mentioned, if I create a model that includes only the interaction spl1*sex (and thus I do not include spl2*sex, spl3*sex and spl4*sex) I don't get "Non-est". But only if I remove all three. Removing just one of them does not change the outcome.
"And as said by @StatsMan , the model you have chosen to fit "might be more than you can do with this data."
I realize that there might be an issue here. When I completely remove the random statement I still get the "Non-est" outcome, which suggest to me that there is an issue with the abovementioned interaction terms.
Thank you
Best regards
@mgrasmussen wrote:
Dear PaigeMiller
Thank you very much for your reply.
"Often, a Least Squares Mean is non-estimable because you have cells in your experiment where there is no data."
Missing data in general is not an issue in this dataset, but their might be a specific issue relating to the interaction terms. Is there some simple way to check for this? As far as I see it, there are four spline variables and the first spline variable corresponds to the variable Days - from which the spline variables are derived - in a 1:1 fashion. As mentioned, if I create a model that includes only the interaction spl1*sex (and thus I do not include spl2*sex, spl3*sex and spl4*sex) I don't get "Non-est". But only if I remove all three. Removing just one of them does not change the outcome.
Sure sounds like an empty cell in your data — which is not the same as "missing" data (although it can be caused by missing data as well). You can check this via running PROC FREQ to investigate all possible interactions.
proc freq data=whatever;
tables sex*(spl1 spl2 spl3 spl4);
run;
Even this may not find the missing cell, it may be that you have to look at higher order interactions by using PROC FREQ to find the cause.
Right, it is an issue about whether or not both males and females are represented at each level of spl1-spl4?
Thanks for the suggested syntax. I ran it and there are many places with empty cells/cell with 0 observations and this is true for all four spline variables. This is also the case for spl1 and Sex. As mentioned when I include the interaction between the two in the model I don't get the "Non-est". This particular model appears to work.
As pointed out by @PaigeMiller , missing data cell is the most common cause of a non-estimable lsmeans. You either need to take out the interaction term(s) with missing data cell, or maybe combine your data categories to get rid of missing data cells.
Thanks,
Jill
Dear Jill
Thank you for the feedback.
I will bring all of the feedback I have gotten, including yours most recently, back to my group and we will discuss the analyses.
Best regards
Martin
In order to understand what you are attempting, I need to know why you are fitting five different spline terms to the same data and then placing all of them in the same analysis (spl spl1 spl2 spl3 spl4). Do the different splines reflect different hypotheses about the time by hba1c relationship? Is there any reason to believe that time as interpreted by the spline function has a fixed effect on the outcome, and you want to see if it differs by sex? As an example for that, look at this in the PROC GLIMMIX documentation: https://documentation.sas.com/doc/en/pgmsascdc/v_036/statug/statug_glimmix_examples09.htm
Or is it possible that there is a single factor response in time, with the variance being what needs a spline fit. As an example look at this in the PROC GLIMMIX documentation: https://documentation.sas.com/doc/en/pgmsascdc/v_036/statug/statug_glimmix_examples20.htm
SteveDenham
Dear Steve
Thank you very much for your reply.
To be frank I do not have a straight answer to your question. I am a non-statistician who tried to build a model (only with brief statistical consult), but I realize now that I might not fully understand the working of the splines.
In simple terms what I want to do is to visually illustrate and statistically test for the differences in slopes in hba1c for men and women. More specifically, I want to investigate whether there is a difference in mean hba1c over time (time on the x-axis) between men and women. However, I also want to be able to test means differences at specific time points.
My understanding was that the four spline variables all-in-all represented time and thus that I had to model an interaction term for each splines variable. I sort of "transferred" the logic I would apply to a normal OLS regression to this new model. In OLS - here assumping a linear relationship between days/time and hba1c - I would write the following to test for interaction of sex in the relationship between time and hba1c:
proc reg data=datasetname;
class sex;
model hba1c=sex days sex*days;
run;
For the time being disregard the fact that I am actually using a mixed effect model. The point here is merely an attempt to try to explain the background for how I am applying the splines in the analysis. However incorrectly.
I realize that I may not be using the correct statistical jargon.
I would be happy to receive any feedback and suggestion on moving forward.
I mentioned earlier, when I only include the first spline variable I don't experience an issue with computing LSmeans. But I am not sure about the significance of this.
Thank you
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.