BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Season
Lapis Lazuli | Level 10

I would like to raise a brief yet sophisticated questions that has been troubling me for months: how can I choose the proper degrees of freedom for the spline effects in my generalized additive model using proc gam?

As is known to us all, SAS 9.4 supports the designation of each and every spline effects by using codes like this: 

model y(event="1")= spline(x1,df=3) spline(x2,df=3),spline(x3,df=3) spline(x4,df=3) spline(x5,df=3)/dist=binomial

However, after reading literature regarding generalized additive models (GAM), I have come to known that the balance between goodness of fit and the interpretability of the model (i.e. avoidance of overfitting) is of extreme importance in GAM. The balance should be reached by paying attention to the smoothing parameters and the residuals/deviances. It should be noted that the aforementioned statistics varies according to the degrees of freedom of each and every of the parameter. Therefore, the degrees of freedom of each spline effect acts like a 'faucet' that controls the three important statistics (smoothing parameters, residuals and deviances).

So here is my question: in SAS 9.4, (1) with what statistic(s) (e.g. The Deviance of the final estimate, residual, etc.) can we measure the goodness of fit of my GAM? (2) What is the principle of choosing the degrees of freedom for each spline effect? How can we ensure achieve the three goals simultaneously: (1) achieving acceptable smoothness of the parameters to ensure the observation of trends rather than noises; (2) ensure and quantitatively assess the goodness of fit of GAMs; (3) avoid over fitting? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ
There is no hard and fast answer for this. The choice of degrees of freedom, essentially a tuning parameter, is where the art and science of modeling intersect - the point where goodness of fit turns into overfitting is essentially subjective. If you want to have an objective way to choose the degrees of freedom, then you have to choose a criterion to optimize. The one available in PROC GAM is cross validation and is available by specifying METHOD=GCV either for individual splines or globally. See the description of the option in the documentation of the MODEL statement, and the description of the application of cross validation to GAM models in "Selection of Smoothing Parameters" in the Details section of the GAM documentation. Note that a newer procedure for fitting GAMs is PROC GAMPL. See "PROC GAMPL Contrasted with PROC GAM" in the Overview section of the GAMPL documentation and this document that also compares them: https://support.sas.com/rnd/app/stat/topics/gam/gam.htm .

View solution in original post

10 REPLIES 10
StatDave
SAS Super FREQ
There is no hard and fast answer for this. The choice of degrees of freedom, essentially a tuning parameter, is where the art and science of modeling intersect - the point where goodness of fit turns into overfitting is essentially subjective. If you want to have an objective way to choose the degrees of freedom, then you have to choose a criterion to optimize. The one available in PROC GAM is cross validation and is available by specifying METHOD=GCV either for individual splines or globally. See the description of the option in the documentation of the MODEL statement, and the description of the application of cross validation to GAM models in "Selection of Smoothing Parameters" in the Details section of the GAM documentation. Note that a newer procedure for fitting GAMs is PROC GAMPL. See "PROC GAMPL Contrasted with PROC GAM" in the Overview section of the GAMPL documentation and this document that also compares them: https://support.sas.com/rnd/app/stat/topics/gam/gam.htm .
Season
Lapis Lazuli | Level 10

Thank you for your detailed reply. Actually, before raising my question here, I have read numerous literatures regarding the generalized additive models (GAMs), but few of them described the principle of choosing the degrees of freedom of the parameters. One paper suggested that the researcher who wanted to build GAMs consult previous published literatures for the selection of degrees of freedom.

I would like to raise a few questions regarding the "method=gcv" argument. In fact, I have already read the part regarding PROC GAM in SAS Help thoroughly before raising my questions yesterday. My questions are: (1) If I did not specify the "method=gcv" argument in the MODEL Statement, what method will SAS be used to choose the smoothing parameters in GAM? In other words, what is the default method of choosing smoothing parameters in PROC GAM? (2) I have found that I am still able to designate the degrees of freedom for the parameters involved in GAM after using the "method=gcv" argument, producing essentially the same model as the one SAS produced without the "method=gcv" argument. For instance, the following two codes produce the same model:

model y(event="1")= spline(x1,df=3)/dist=binomial
model y(event="1")= spline(x1,df=3)/dist=binomial method=gcv

However, the following code produce a model different from the model formed by either of the two codes above:

model y(event="1")= spline(x1)/dist=binomial method=gcv

After consulting SAS Help yet again for an explanation to this phenomenon (and actually did not find an answer, for SAS Help did not mention the issue above), I guessed that it was because of the following assumption: the "method=gcv" argument is in fact a method to help the user find appropriate degrees of freedom for the spline effects. Had the user designated degrees of freedom on his/her own, the "method=gcv" argument would be de facto "inactivated". I wonder whether my assumption is correct.

Finally, I would like to ask about the PROC GAMPL procedure you have mentioned. I have installed SAS9.4 in my computer, but I retrieved no results when I typed in "proc gampl" in my SAS Help window. I wonder on which platform can PROC GAMPL be run.

Thank you for your time and attention again!

Rick_SAS
SAS Super FREQ

Regarding the availability, PROC GAMPL was released with SAS/STAT 14.1 in SAS 9.4 TS1M3. For an overview, see Rodriguez (2016). For an example and discussion, see "Nonparametric regression for binary response data in SAS."

 

I strongly encourage you to use PROC GAMPL rather than the old PROC GAM. PROC GAMPL is much faster and uses more modern research and algorithms.

Season
Lapis Lazuli | Level 10

Thank you, Rick, for your helpful reply. I have come to known that the SAS version installed on my computer is SAS 9.4 TS Level 1M2. You have mentioned that PROC GAMPL is available on SAS 9.4 TS Level 1M3. Therefore, I would like to consult a question somehow deviated from statistical methods: My institution has recently renewed our SAS access. I wonder if I could directly update my SAS to SAS 9.4 TS Level 1M3. If I can, how should I do this?

Thank you for your time spent and attention paid to my questions! 

Rick_SAS
SAS Super FREQ

I specialize in statistics and programming, not administration, so unfortunately I do not know the answer to your question. However, I am confident that others on this forum can offer advice about updating your version of SAS.

Season
Lapis Lazuli | Level 10

Never mind. In fact, I have discussed with a professor in my institution on the problems I have raised on this forum previously. Despite the fact that he was an earnest person and specially retrieved a paper regarding generalized additive models (GAMs) and the selection of the degrees of freedom for me, both he and I did not find an exact answer to that question. Having read numerous papers regarding GAMs, I was still confused about this issue. So in the end I decided to raise my questions here. Thank you @Rick_SAS and @StatDave for your detailed and helpful replies. Your replies have saved me lots of time and energy which may otherwise be spent on statistical methods, rather than my research per se.

I would like to raise another question on statistics regarding GAM: since it takes time to obtain a new version of SAS (I may have to travel to the headquarters of my institution if a new installation package is a must in updating SAS, but since my schedule is full recently, I may have to use the current SAS version for yet another period of time), is the "method=gcv" argument in PROC GAM a reliable option to obtain degrees of freedom for the spline effects, from your point of view?

Finally, I would like to prompt a few suggestions for the SAS Help regarding GAMs. (1) I do hope that SAS could provide an example for choosing degrees of freedom in GAMs; (2) I do hope that SAS could provide explanations and calculating formulae for the following parameters available in PROC GAM: a) The Deviance of the final estimate, b) Weighted Residuals that can be printed using ODS Table, c) Sum of squares (of the spline effects) in the table entitled "Smoothing Model Analysis Analysis of Deviance", d) "Num Unique Obs" in the table entitled "Smoothing Model Analysis Fit Summary for Smoothing Components". SAS Help provided explanation for none of the statistics I mentioned above, somehow confusing me when it comes to interpreting the results.

Thank you both for your kind help once again!

SteveDenham
Jade | Level 19

Going back to the df question, the GAMPL documentation says:

 

The degrees of freedom for generalized additive models that are fitted by the GAMPL procedure is defined as the trace of the degrees-of-freedom matrix. The degrees of freedom for generalized additive models that are fitted by the GAM procedure is approximated by summing the trace of the smoothing matrix for each smoothing term.

 

That should fill you in on how the df are obtained. Your other questions can (mostly) be answered by going through the Details tab for GAM or GAMPL.

 

SteveDenham

Season
Lapis Lazuli | Level 10

Thank you, Steve, for your kind help. Unfortunately, I had already accepted @StatDave 's reply as the solution. I am pretty new at the forum, so I do not know how I could select all of your replies as the solution to this problem. Sorry for any inconvenience it may cause. I should have selected the solution later.

I noticed the description of the degrees of freedom that was given in a "matrix algebra" fashion. In SAS Help, the degree of freedom for the jth smoothing parameter is the trace of Aj(λj). But as a medical researcher who had barely received any education on matrix algebra and calculus before (in fact, I have successfully taught myself some basic linear algebra in the past few months and has already mastered the definition of the trace of matrix, but I know no more about that), I find myself still confused after consulting SAS Help, for I do not know what the trace of Aj(λj) stands for. I know that it may be too "greedy" to let statisticians express concepts in plain English (e.g. P value of a parameter means the probability that the null hypothesis is true), but unfortunately I knew nothing about the significance of the concept after consulting SAS Help alone. Now that you have given me a more clarified explanation about the concept, I have mastered more about that concept (but I am still not fully clear about the significance of the trace of the degrees-of freedom matrix and is therefore still confused by questions like "If the trace is high, what does it mean? Will it have an impact on the model? If so, what impact? How should I "maximize" the positive impacts while "minimize" the negative ones?", sorry...) I will continue to work hard on more advanced and sophisticated matrix algebra (which may seem basic to statisticians like you, of course) to understand statistics better, but it takes time to do so... Anyway, thank you for your help!

SteveDenham
Jade | Level 19

The trace of a matrix is the sum of all the entries on the main diagonal. If you fit thin-plate splines (GAMPL), it is the sum of the main diagonal of the degrees of freedom matrix. I looked for a way to get this out of GAMPL, but am failing at that right now.  In GAM, each segment contribute, so asymptotically the sum of those will approximate the degrees of freedom.

 

SteveDenham 

Season
Lapis Lazuli | Level 10

Thank you very much, Steve, for your patient explanation on the concepts. Actually, what I am wondering is something more pragmatic-how does the degrees of freedom influence the model?

As you have explained, the sum of the trace of the smoothing matrices asymptotically approximates the degrees of freedom of the GAM. But as a researcher, what I am concerned is something more practical. To put my ideas in oral English, my reaction upon hearing the definition of the degrees of freedom may be like "OK, I know that. So what? Does that bring trouble to me?" (I am not a native English speaker, sorry for any offense the "So what" may cause, but it really reflects my concerns)

More specifically, I am looking for a tactic or a flow chart that deals with matters regarding the degrees of freedom of GAM. I am like a tourist traveling to an alien place and has got lost. Therefore, what I badly need right now is how I could find the place I am going. I hope that somebody who is familiar with the area act as a guide to help me find the place. I am not that interested in the history or architecture of the buildings that surrounds me, at least not now. This "lost tourist" sentiment is the one I, as a researcher with not that much mathematical proficiency, am experiencing when it comes to a question I must tackle.

I know it may be hard for others to do this if the person in need is too ignorant on the topic, just like it is hard for a university professor to teach a first-grade primary school student calculus. But that is the case for me. I will work hard to learn more mathematical knowledge to deal with problem on my own, but before I succeed in doing so, perhaps asking for the help of more experienced experts is the only choice left for me.  

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1963 views
  • 8 likes
  • 4 in conversation