Solved: Re: GAMSELECT and values outside of the interior knot ranges

andrea_magatti · Posted 08-11-2021 03:44 AM

Hi all,

while running this code:

proc gamselect data=casuser.gam_test_gamselect seed=220870 plots;
	model target=
		spline(P_AR_GLO P_AR_sem /  degree=2 difforder=1 details df=7 )
		spline(P_B_glo P_B_sem   /  degree=3 difforder=1 details df=7) 
		spline(P_F_sem P_F_tri   /  degree=3 difforder=2 details df=6) 
		spline(P_O_glo P_O_sta   /  degree=3 difforder=2 details df=6) 
		spline(P_S_sem P_S_tri   /  degree=3 difforder=2 details df=6) 
		spline(P_T_sem P_T_sta   /  degree=3 difforder=2 details df=6)	/distribution=normal link=id;
	displayout SplineDetails=splinedet;
	partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
	selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
	output out=casuser.forecast_gamsel copyvars=(data target _role_);
run;

I'm getting this message:

NOTE: One observation with the validation role was omitted due to values outside of the interior knot ranges.
NOTE: 481393 bytes were written to the table "gam_model" in the caslib "CASUSER".

As far as I know, I could increase the number of knots for any spline terms (actually, each spline is using 10 interior knots, that should be the default values) using the MAXKNOTS= option on each term.

But even increasing the number to a huge (and unuseful number), I'm getting the same message.

Inspecting the data, the omitted points are "almost" maximum or minimum values in each partition.

Any suggestion is welcome.

Thanks

Rick_SAS · Posted 08-11-2021 09:47 AM

To expand on what Michael has said, the NOTE is caused because one of the observations where _ROLE_='valid' is outside of the range of the variables for _ROLE_='train'. You can see this in the following example, in which I have manually created a validation observation that has (X,Y)=(20,20), which is outside of the range of the training values for (X,Y). As Micahel says, you can use the ALLOBS option to change the knot basis.

cas;
libname mylib cas;

data mylib.gam_test;
call streaminit(1234);
_role_ = 'train';
do i = 1 to 100;
   X = 2*rand("normal");
   Y = 2*rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;

_role_ = 'valid';
do i = 1 to 20;
   X = rand("normal");
   Y = rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;

_role_ = 'test';
do i = 1 to 20;
   X = rand("normal");
   Y = rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;
/* This obs is beyond the range of the training data */
X = 20; y = 20; target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
drop i;
run;

proc gamselect data=mylib.gam_test seed=220870 plots;
	model target=
		spline(X Y /  degree=2 difforder=1 details df=7 )	/ /* allobs */
		 distribution=normal link=id;
	displayout SplineDetails=splinedet;
	partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
	selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
	output out=mylib.forecast_gamsel copyvars=(X Y target _role_);
run;

View solution in original post

ballardw · Posted 08-11-2021 04:16 AM

Look at your maximum and minimum variable values. You may have a much larger/smaller value than you think. That may mean you have a data point to clean up.

MichaelL_SAS · Posted 08-11-2021 09:18 AM

By default, the boosting selection method in PROC GAMSELECT will use evenly spaced knots based on the range of values for observations with the training data role. You can use the ALLOBS option in the MODEL statement to request that observations from all data roles be used in the knot selection, which should lead to all observations having values in the interior knot range and no test or validation data being omitted.

Rick_SAS · Posted 08-11-2021 09:47 AM

To expand on what Michael has said, the NOTE is caused because one of the observations where _ROLE_='valid' is outside of the range of the variables for _ROLE_='train'. You can see this in the following example, in which I have manually created a validation observation that has (X,Y)=(20,20), which is outside of the range of the training values for (X,Y). As Micahel says, you can use the ALLOBS option to change the knot basis.

cas;
libname mylib cas;

data mylib.gam_test;
call streaminit(1234);
_role_ = 'train';
do i = 1 to 100;
   X = 2*rand("normal");
   Y = 2*rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;

_role_ = 'valid';
do i = 1 to 20;
   X = rand("normal");
   Y = rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;

_role_ = 'test';
do i = 1 to 20;
   X = rand("normal");
   Y = rand("normal");
   target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
   output;
end;
/* This obs is beyond the range of the training data */
X = 20; y = 20; target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
drop i;
run;

proc gamselect data=mylib.gam_test seed=220870 plots;
	model target=
		spline(X Y /  degree=2 difforder=1 details df=7 )	/ /* allobs */
		 distribution=normal link=id;
	displayout SplineDetails=splinedet;
	partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
	selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
	output out=mylib.forecast_gamsel copyvars=(X Y target _role_);
run;

andrea_magatti · Posted 08-11-2021 10:33 AM

Thanks to both!

Since the data provided are coming from 12 different forecast models, I used gamselect as an ensemble (btw with excellent results).

I'm quite suspicious of using the "allobs" option since the spline parameters are built using even future data.

I'm choosing a different approach that identifies extreme forecasts and replaces their value with the mean of all the other models.

Thanks

Rick_SAS · Posted 08-11-2021 10:59 AM

It's not really accurate to say "the spline parameters are built using even future data." The parameter estimates use only the training data. It is the location of the knots that you are changing so that you never need to extrapolate but you are always interpolating.

If you know the possible range of the variables, you can use the KNOTS=LIST(x1 y1 ... xn yn) syntax to specify the knot locations without relying on future observations. For example, if the variables are standardized uniform on [0, 1], you could specify the knots on a uniform 3x3 grid by using

KNOTS=LIST(0   0  0    0.5  0   1  
           0.5 0  0.5 0.5   0.5 1  
           1   0  1    0.5  1   1 )

MichaelL_SAS · Posted 08-11-2021 11:32 AM

Rick makes an important point clarifying how exactly the ALLOBS option is used, it does indeed only change the data range used for evaluating knot placement and does not otherwise effect the model training.

I will just also note that the data range is also applied to knot lists and PROC GAMSELECT will print a note to the log when input knot values are outside of the data range and ignored. This matches the behavior of the GAMMOD and GAMPL procedures.

andrea_magatti · Posted 08-11-2021 12:14 PM

Hi all,
I'm writing a final note about GAMSELECT and gam.gamSelect actionSet.
Both of them are producing precisely the same results. Still, as a positive note, the actionSet is much faster than the procedure (0.6 seconds vs. 6 seconds), but also gives us the possibility to separate the model building phase, from the scoring stage.
This behavior is a natural enhancement to the GAM family, that gives us much more flexibility in a production environment.
Thanks again!

Rick_SAS · Posted 08-11-2021 01:00 PM

I wouldn't expect that much of a difference in performance. By default, PROC GAMSELECT produces a panel of plots of partial prediction curves or surfaces of smoothing components (surface, dfor your example). I suspect that is why the PROC takes longer. Use PLOTS=NONE if you don't want the plots.

Ready to join fellow brilliant minds for the SAS Hackathon?