Hello SAS users:
I am new to this forum and have a couple related questions about the confidence intervals and predicted values produced by Proc GAM.
I am trying to derive these values for a time series analysis of yearly sea surface temperature (SST) data.
Question 1:
The confidence limits for the predicted SST in any given year should be given by SST + uclm_year and SST + lclm_year, where the uclm and lclm variables are the SAS output variables produced in the Output option of Proc GAM, and SST is the long-term mean of the original time series. However, when I plot these confidence intervals they are not symmetric about the predicted fit, and indeed sometimes become very close to the fitted trend values. These seems to be an error somewhere, but I don’t know how to fix it.
Question 2: As part of the output, SAS gives P_SST, which is the predicted (fitted) annual SST for each year in the time series. I understand that P_SST should be equal to the sum of the long-term mean SST and the annual “year effect” given by P_year in the Output from Proc GAM. However, when I do the addition manually to check the SAS output, the summation is not equal to P_SST produced by Proc GAM. Why does it differ?
The code and data I have used are below. I am using SAS version 9.4.
proc gam data = analysis;
model sst = loess(year, DF = 8);
output out = fit pred uclm lclm adiag std resid;
run;
year sst
1870 9.8
1871 10.4
1872 10
1873 10
1874 9.8
1875 10.4
1876 9.6
1877 10.1
1878 10.5
1879 10
1880 10.7
1881 9.4
1882 8.7
1883 9.3
1884 9.2
1885 9.7
1886 9.4
1887 10.2
1888 10.6
1889 9.7
1890 9.6
1891 9.9
1892 9.9
1893 10.4
1894 9.4
1895 10.1
1896 9.5
1897 9.7
1898 9.7
1899 10.6
1900 10.4
1901 9.7
1902 10.3
1903 9.6
1904 9.7
1905 9.5
1906 9.4
1907 9.6
1908 9.3
1909 9.8
1910 9.6
1911 9.3
1912 9.1
1913 9.5
1914 9
1915 10.1
1916 10
1917 10
1918 8.9
1919 9
1920 8.8
1921 8.8
1922 9
1923 9.3
1924 9.5
1925 8.6
1926 9.5
1927 10.5
1928 10.3
1929 10.3
1930 9.4
1931 10.2
1932 10.1
1933 10.1
1934 9.4
1935 9.7
1936 10
1937 9.3
1938 9.2
1939 10.1
1940 9.9
1941 10.1
1942 9.7
1943 9.7
1944 9.5
1945 9.7
1946 9.5
1947 9.7
1948 10.2
1949 9.4
1950 10.3
1951 10
1952 10.2
1953 10.3
1954 10.1
1955 9.5
1956 9.9
1957 10.4
1958 11
1959 9.9
1960 10.6
1961 9.6
1962 9.7
1963 9
1964 9.7
1965 10.3
1966 10.4
1967 9.6
1968 10.2
1969 8.9
1970 8.7
1971 9.2
1972 8.6
1973 8.9
1974 9.6
1975 9.1
1976 8.7
1977 9.8
1978 9.6
1979 9.4
1980 10.2
1981 9.3
1982 9.1
1983 7.8
1984 8.8
1985 9.6
1986 9.5
1987 10.3
1988 9.6
1989 8.8
1990 9.5
1991 9.8
1992 9
1993 9.2
1994 9.5
1995 9.5
1996 9.5
1997 10.5
1998 10.7
1999 10.2
2000 10
2001 10.4
2002 9.8
2003 11.2
2004 10.5
2005 10.3
2006 10.3
2007 10.7
2008 10.7
2009 10.4
2010 11.6
2011 10.3
2012 11.6
2013 9.8
2014 10.7
2015 9.6
2016 10.9
2017 10.2
2018 9.2
2019 11.3
Yes. The deviance is not reported in the output because it is not used for model fitting, in contrast to PROC GAM.
If you need the deviance, you can use the OUTPUT statement to request predicted means then compute deviance by using the formula provided in the GENMOD doc
proc gampl data=test seed=123;
model y=spline(x1) spline(x2);
output out=myout pred=mu;
id y;
run;
/* now use data step to compute deviance using myout */
For the deviance of the NULL model, you can use PROC GENMOD to run an intercept-only model. The Deviance statistic is in the “ModelFit” table:
proc genmod data=test;
model y=;
run;
Look at
The confidence intervals for LCLM and UCLM are not mathematically the same as the plotted intervals.
Thank you PGStats.
This is helpful, but I have difficulties to see how this solves my problem. That is, I am still not able to get the confidence intervals plotted correctly.
Should not P_y = overall time series mean + P_x? For the notation of my variables, this would be
P_sst = overall mean + P_year.
In my case, this does not work because when I manually do the addition of mean + P_year, it does not equal P_sst, as seen in the top panel of the attached figure. I thought it should, and it has worked like this for me in the past with earlier versions of SAS. I think this issue, if resolved, could help solve the other question in my original query.
Also, the P_y should lie in the middle of the upper and lower 95% CL, as given by mean + uclm_x and mean + lclm_x. But it does not, as seen in the lower panel of the attached figure file.
Am I misinterpreting the meaning of P_y and P_x, and if so, how should they be interpreted?
A few comments:
1. PROC GAM is an old procedure and its main purpose is understanding how a TRANSFORMATION of the explanatory variable can be used to predict the response. If you use
proc gam data = analysis plots=component(clm);
...
you will see the transformation for the YEAR variable.
2. The STD, UCLM, and LCLM options on the OUTPUT statement are NOT for the prediction limits for the response. They are associated with the transformation of the explanatory variable.
3. The GAM doc has an example that explains the difference between GAM and LOESS in terms of smoothing data.
4. IMHO, if you want prediction limits, you should use PROC LOESS or PROC GAMPL, which is a newer and more efficient version of GAM that is intended for predictive models. The following example creates a smoother with prediction limits for your data:
proc gampl data = analysis plots=components;
model sst = spline(year / DF = 8);
output out = fit pred upper lower;
run;
data All;
merge analysis fit;
run;
proc sgplot data=All noautolegend;
band x=year lower=Lower upper=Upper;
scatter x=year y=sst;
series x=year y=Pred;
run;
Dear Rick_SAS,
Many thanks for this reply – it was really helpful and I have done several analyses in the past few weeks using Proc GAMPL.
However I also need an estimate of overall explained deviance, and cannot find it in the Proc GAMPL output (Proc GAM provides it).
I also need an estimate of the null model deviance.
How can I find or derive these deviance estimates from the output provided by the procedure?
Thanks again for any suggestions.
Best regards
bmac1
Yes. The deviance is not reported in the output because it is not used for model fitting, in contrast to PROC GAM.
If you need the deviance, you can use the OUTPUT statement to request predicted means then compute deviance by using the formula provided in the GENMOD doc
proc gampl data=test seed=123;
model y=spline(x1) spline(x2);
output out=myout pred=mu;
id y;
run;
/* now use data step to compute deviance using myout */
For the deviance of the NULL model, you can use PROC GENMOD to run an intercept-only model. The Deviance statistic is in the “ModelFit” table:
proc genmod data=test;
model y=;
run;
HI Rick_SAS,
Thanks again. This helped a lot! I ran some a few test examples today and have been able to get the deviances and AIC's for fitted models and the null models using your suggestions.
Best regards
bmac1
If your concern about deviance values from GAMPL is because you want a statistic providing an overall assessment of the fit of the model, then you can get an R-square value using the RsquareV macro. See Example 2 in the Results tab of the macro documentation which shows using the macro to assess the fit of a spline model fit in PROC GAMPL.
Hi,
Thank you very much. Yes, this is what I had in mind and I was not aware of this macro option. It looks like a good way to get an overall fitness of the models. I will probably however use the deviances and AIC outputs that I have been able to get thanks to the post by Rick_SAS. I can see that the principles are somewhat similar so I have made a note of the documentation in case I decide to use this method.
Best regards
bmac1
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.