Solved: Variable Transformation

NicolasC · Posted 08-14-2017 05:13 AM

Hi there

I am wondering about the necessity of transforming my interval-scaled input variables. My target is also interval-scaled and I perform no transformation on it. I also have class input variables.

Basically, I compare the outputs from 5 models, using the Model Comparison Node and the Average Squared Error is roughly the same for each model regardless if I transformed my input variables (optimal binning) or not. Could someone explain to me the mathematical necessity of binning input variables? In case this information is important, I also intend to transform log (one side of my diagram) and standardize them (another one) after the binning/no binning Node. Hope it is clear.

DougWielenga · Posted 08-16-2017 05:21 PM

I am wondering about the necessity of transforming my interval-scaled input variables.

Different modeling methods will be differently impacted by the scale/distribution of the input variables. Tree-based models, for instance, would only depend on the ordering of the observations regardless of their magnitude. In reality, you might get different split points when comparing the splits for a variable to the splits for the log of that same variable, but it should not lead to major differences if you have sufficient data.

But is transforming the input variables 'necessary'? The short answer is that it will be more help in some methods than in others. For less flexible modeling methods like regression models, it might be very important in some cases while it might be less important for more flexible modeling methods like neural networks. It should have limited impact on tree-based models as described above.

Could someone explain to me the mathematical necessity of binning input variables?

Data preparation is a way to obtain better performing models from the same data set. As I mentioned above, the impact of those transformations can vary greatly depending on the modeling method, the distributions of the variables being transformed, and the modeling methods being used. There is no 'necessity' in that case but it might be desirable. The difference between good and great model could be simply how the data is prepared in some cases. Binning summarizes data which loses information in one sense yet can make the predictive model better should you be using a less flexible method like regression.

For interval variables, considering binned versions of your interval inputs allows you to model non-linearity that might not be easily captured by interactions and/or higher-order terms often used to make a regression model more 'flexible'. Considering both binned and raw versions of these variables in further variable selection will provide the variable selection routine with different ways to use the same information. The binning is not necessary but it stands to reason that considering potentially non-linear relationships should be of help. As a result, the binned variables might have a dramatic impact on regression models but would typically have a lesser impact on nonlinear modeling approaches like trees and neural networks unless the variables were very poorly conditioned.

Hope this helps!

Doug

View solution in original post

DougWielenga · Posted 08-16-2017 05:21 PM

I am wondering about the necessity of transforming my interval-scaled input variables.

Different modeling methods will be differently impacted by the scale/distribution of the input variables. Tree-based models, for instance, would only depend on the ordering of the observations regardless of their magnitude. In reality, you might get different split points when comparing the splits for a variable to the splits for the log of that same variable, but it should not lead to major differences if you have sufficient data.

But is transforming the input variables 'necessary'? The short answer is that it will be more help in some methods than in others. For less flexible modeling methods like regression models, it might be very important in some cases while it might be less important for more flexible modeling methods like neural networks. It should have limited impact on tree-based models as described above.

Could someone explain to me the mathematical necessity of binning input variables?

Data preparation is a way to obtain better performing models from the same data set. As I mentioned above, the impact of those transformations can vary greatly depending on the modeling method, the distributions of the variables being transformed, and the modeling methods being used. There is no 'necessity' in that case but it might be desirable. The difference between good and great model could be simply how the data is prepared in some cases. Binning summarizes data which loses information in one sense yet can make the predictive model better should you be using a less flexible method like regression.

For interval variables, considering binned versions of your interval inputs allows you to model non-linearity that might not be easily captured by interactions and/or higher-order terms often used to make a regression model more 'flexible'. Considering both binned and raw versions of these variables in further variable selection will provide the variable selection routine with different ways to use the same information. The binning is not necessary but it stands to reason that considering potentially non-linear relationships should be of help. As a result, the binned variables might have a dramatic impact on regression models but would typically have a lesser impact on nonlinear modeling approaches like trees and neural networks unless the variables were very poorly conditioned.

Hope this helps!

Doug

Variable Transformation

Re: Variable Transformation

Re: Variable Transformation

Variable Transformation

Re: Variable Transformation

Re: Variable Transformation

Ready to join fellow brilliant minds for the SAS Hackathon?