BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
andrea_magatti
Obsidian | Level 7

Hi all,

while running this code:

proc gamselect data=casuser.gam_test_gamselect seed=220870 plots;
	model target=
		spline(P_AR_GLO P_AR_sem /  degree=2 difforder=1 details df=7 )
		spline(P_B_glo P_B_sem   /  degree=3 difforder=1 details df=7) 
		spline(P_F_sem P_F_tri   /  degree=3 difforder=2 details df=6) 
		spline(P_O_glo P_O_sta   /  degree=3 difforder=2 details df=6) 
		spline(P_S_sem P_S_tri   /  degree=3 difforder=2 details df=6) 
		spline(P_T_sem P_T_sta   /  degree=3 difforder=2 details df=6)	/distribution=normal link=id;
	displayout SplineDetails=splinedet;
	partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
	selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
	output out=casuser.forecast_gamsel copyvars=(data target _role_);
run;

I'm  getting this message:

NOTE: One observation with the validation role was omitted due to values outside of the interior knot ranges.
NOTE: 481393 bytes were written to the table "gam_model" in the caslib "CASUSER".

As far as I know, I could increase the number of knots for any spline terms (actually, each spline is using 10 interior knots, that should be the default values) using the MAXKNOTS= option on each term.

But even increasing the number to a huge (and unuseful number), I'm getting the same message.

 

Inspecting the data, the omitted points are "almost" maximum or minimum values in each partition.

 

Any suggestion is welcome.

Thanks

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

To expand on what Michael has said, the NOTE is caused because one of the observations where _ROLE_='valid' is outside of the range of the variables for _ROLE_='train'. You can see this in the following example, in which I have manually created a validation observation that has (X,Y)=(20,20), which is outside of the range of the training values for (X,Y).  As Micahel says, you can use the ALLOBS option to change the knot basis.

 

cas;
libname mylib cas;

data mylib.gam_test;
call streaminit(1234);
_role_ = 'train';
do i = 1 to 100;
   X = 2*rand("normal");
   Y = 2*rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;

_role_ = 'valid';
do i = 1 to 20;
   X = rand("normal");
   Y = rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;

_role_ = 'test';
do i = 1 to 20;
   X = rand("normal");
   Y = rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;
/* This obs is beyond the range of the training data */
X = 20; y = 20; target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
drop i;
run;

proc gamselect data=mylib.gam_test seed=220870 plots;
	model target=
		spline(X Y /  degree=2 difforder=1 details df=7 )	/ /* allobs */
		 distribution=normal link=id;
	displayout SplineDetails=splinedet;
	partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
	selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
	output out=mylib.forecast_gamsel copyvars=(X Y target _role_);
run;

View solution in original post

8 REPLIES 8
ballardw
Super User

Look at your maximum and minimum variable values. You may have a much larger/smaller value than you think. That may mean you have a data point to clean up.

MichaelL_SAS
SAS Employee

By default, the boosting selection method in PROC GAMSELECT will use evenly spaced knots based on the range of values for observations with the training data role. You can use the ALLOBS option in the MODEL statement to request that observations from all data roles be used in the knot selection, which should lead to all observations having values in the interior knot range and no test or validation data being omitted. 

Rick_SAS
SAS Super FREQ

To expand on what Michael has said, the NOTE is caused because one of the observations where _ROLE_='valid' is outside of the range of the variables for _ROLE_='train'. You can see this in the following example, in which I have manually created a validation observation that has (X,Y)=(20,20), which is outside of the range of the training values for (X,Y).  As Micahel says, you can use the ALLOBS option to change the knot basis.

 

cas;
libname mylib cas;

data mylib.gam_test;
call streaminit(1234);
_role_ = 'train';
do i = 1 to 100;
   X = 2*rand("normal");
   Y = 2*rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;

_role_ = 'valid';
do i = 1 to 20;
   X = rand("normal");
   Y = rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;

_role_ = 'test';
do i = 1 to 20;
   X = rand("normal");
   Y = rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;
/* This obs is beyond the range of the training data */
X = 20; y = 20; target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
drop i;
run;

proc gamselect data=mylib.gam_test seed=220870 plots;
	model target=
		spline(X Y /  degree=2 difforder=1 details df=7 )	/ /* allobs */
		 distribution=normal link=id;
	displayout SplineDetails=splinedet;
	partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
	selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
	output out=mylib.forecast_gamsel copyvars=(X Y target _role_);
run;
andrea_magatti
Obsidian | Level 7

Thanks to both!

Since the data provided are coming from 12 different forecast models, I used gamselect as an ensemble (btw with excellent results).

 

I'm quite suspicious of using the "allobs" option since the spline parameters are built using even future data.

I'm choosing a different approach that identifies extreme forecasts and replaces their value with the mean of all the other models.

Thanks

Rick_SAS
SAS Super FREQ

It's not really accurate to say "the spline parameters are built using even future data." The parameter estimates use only the training data. It is the location of the knots that you are changing so that you never need to extrapolate but you are always interpolating. 

 

If you know the possible range of the variables, you can use the KNOTS=LIST(x1 y1 ... xn yn) syntax to specify the knot locations without relying on future observations. For example, if the variables are standardized uniform on [0, 1], you could specify the knots on a uniform 3x3 grid by using

KNOTS=LIST(0   0  0    0.5  0   1  
           0.5 0  0.5 0.5   0.5 1  
           1   0  1    0.5  1   1 )
MichaelL_SAS
SAS Employee
Rick makes an important point clarifying how exactly the ALLOBS option is used, it does indeed only change the data range used for evaluating knot placement and does not otherwise effect the model training.

I will just also note that the data range is also applied to knot lists and PROC GAMSELECT will print a note to the log when input knot values are outside of the data range and ignored. This matches the behavior of the GAMMOD and GAMPL procedures.
andrea_magatti
Obsidian | Level 7
Hi all,
I'm writing a final note about GAMSELECT and gam.gamSelect actionSet.
Both of them are producing precisely the same results. Still, as a positive note, the actionSet is much faster than the procedure (0.6 seconds vs. 6 seconds), but also gives us the possibility to separate the model building phase, from the scoring stage.
This behavior is a natural enhancement to the GAM family, that gives us much more flexibility in a production environment.
Thanks again!
Rick_SAS
SAS Super FREQ

I wouldn't expect that much of a difference in performance. By default, PROC GAMSELECT produces a panel of plots of partial prediction curves or surfaces of smoothing components (surface, dfor your example). I suspect that is why the PROC takes longer. Use PLOTS=NONE if you don't want the plots.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 853 views
  • 3 likes
  • 4 in conversation