Hi all,
while running this code:
proc gamselect data=casuser.gam_test_gamselect seed=220870 plots;
model target=
spline(P_AR_GLO P_AR_sem / degree=2 difforder=1 details df=7 )
spline(P_B_glo P_B_sem / degree=3 difforder=1 details df=7)
spline(P_F_sem P_F_tri / degree=3 difforder=2 details df=6)
spline(P_O_glo P_O_sta / degree=3 difforder=2 details df=6)
spline(P_S_sem P_S_tri / degree=3 difforder=2 details df=6)
spline(P_T_sem P_T_sta / degree=3 difforder=2 details df=6) /distribution=normal link=id;
displayout SplineDetails=splinedet;
partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
output out=casuser.forecast_gamsel copyvars=(data target _role_);
run;
I'm getting this message:
NOTE: One observation with the validation role was omitted due to values outside of the interior knot ranges.NOTE: 481393 bytes were written to the table "gam_model" in the caslib "CASUSER".
As far as I know, I could increase the number of knots for any spline terms (actually, each spline is using 10 interior knots, that should be the default values) using the MAXKNOTS= option on each term.
But even increasing the number to a huge (and unuseful number), I'm getting the same message.
Inspecting the data, the omitted points are "almost" maximum or minimum values in each partition.
Any suggestion is welcome.
Thanks
To expand on what Michael has said, the NOTE is caused because one of the observations where _ROLE_='valid' is outside of the range of the variables for _ROLE_='train'. You can see this in the following example, in which I have manually created a validation observation that has (X,Y)=(20,20), which is outside of the range of the training values for (X,Y). As Micahel says, you can use the ALLOBS option to change the knot basis.
cas;
libname mylib cas;
data mylib.gam_test;
call streaminit(1234);
_role_ = 'train';
do i = 1 to 100;
X = 2*rand("normal");
Y = 2*rand("normal");
target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
end;
_role_ = 'valid';
do i = 1 to 20;
X = rand("normal");
Y = rand("normal");
target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
end;
_role_ = 'test';
do i = 1 to 20;
X = rand("normal");
Y = rand("normal");
target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
end;
/* This obs is beyond the range of the training data */
X = 20; y = 20; target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
drop i;
run;
proc gamselect data=mylib.gam_test seed=220870 plots;
model target=
spline(X Y / degree=2 difforder=1 details df=7 ) / /* allobs */
distribution=normal link=id;
displayout SplineDetails=splinedet;
partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
output out=mylib.forecast_gamsel copyvars=(X Y target _role_);
run;
Look at your maximum and minimum variable values. You may have a much larger/smaller value than you think. That may mean you have a data point to clean up.
By default, the boosting selection method in PROC GAMSELECT will use evenly spaced knots based on the range of values for observations with the training data role. You can use the ALLOBS option in the MODEL statement to request that observations from all data roles be used in the knot selection, which should lead to all observations having values in the interior knot range and no test or validation data being omitted.
To expand on what Michael has said, the NOTE is caused because one of the observations where _ROLE_='valid' is outside of the range of the variables for _ROLE_='train'. You can see this in the following example, in which I have manually created a validation observation that has (X,Y)=(20,20), which is outside of the range of the training values for (X,Y). As Micahel says, you can use the ALLOBS option to change the knot basis.
cas;
libname mylib cas;
data mylib.gam_test;
call streaminit(1234);
_role_ = 'train';
do i = 1 to 100;
X = 2*rand("normal");
Y = 2*rand("normal");
target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
end;
_role_ = 'valid';
do i = 1 to 20;
X = rand("normal");
Y = rand("normal");
target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
end;
_role_ = 'test';
do i = 1 to 20;
X = rand("normal");
Y = rand("normal");
target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
end;
/* This obs is beyond the range of the training data */
X = 20; y = 20; target = 2*x - x**2/3 + 3*x - y**2/4 + rand("normal", 0, 0.1);
output;
drop i;
run;
proc gamselect data=mylib.gam_test seed=220870 plots;
model target=
spline(X Y / degree=2 difforder=1 details df=7 ) / /* allobs */
distribution=normal link=id;
displayout SplineDetails=splinedet;
partition rolevar=_role_(TRAIN='train' VALIDATE='valid' TEST='test');
selection method=boosting(choose=VALIDATE maxiter=500 STEPSIZE=0.10);
output out=mylib.forecast_gamsel copyvars=(X Y target _role_);
run;
Thanks to both!
Since the data provided are coming from 12 different forecast models, I used gamselect as an ensemble (btw with excellent results).
I'm quite suspicious of using the "allobs" option since the spline parameters are built using even future data.
I'm choosing a different approach that identifies extreme forecasts and replaces their value with the mean of all the other models.
Thanks
It's not really accurate to say "the spline parameters are built using even future data." The parameter estimates use only the training data. It is the location of the knots that you are changing so that you never need to extrapolate but you are always interpolating.
If you know the possible range of the variables, you can use the KNOTS=LIST(x1 y1 ... xn yn) syntax to specify the knot locations without relying on future observations. For example, if the variables are standardized uniform on [0, 1], you could specify the knots on a uniform 3x3 grid by using
KNOTS=LIST(0 0 0 0.5 0 1
0.5 0 0.5 0.5 0.5 1
1 0 1 0.5 1 1 )
I wouldn't expect that much of a difference in performance. By default, PROC GAMSELECT produces a panel of plots of partial prediction curves or surfaces of smoothing components (surface, dfor your example). I suspect that is why the PROC takes longer. Use PLOTS=NONE if you don't want the plots.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.