Hello,
So I have done a lot of research on regression modelling and according to many articles, some people have done dummy coding or group by rare levels to transform their categorical variables in SAS Enterprise Miner however some have allowed their model to run without transforming their categorical variables.
My question is that is that really necessary, for example, one of my categorical variables is Armed Category, this is something I preprocessed on my own and condensed from another column known as Armed. The armed category consists of approximately 10-11 levels and has 'Blunt Weapon', 'Gun', 'Other Weapon' etc.
Another variable is Region which has 5 levels.
I need to understand why is it so important to transform these? Can't we simply leave them as is and let the regression node run and produce the results? I know for certain interval variables we need to possibly log them and my only one is Age, which I am also a little confused about whether I should go ahead and bin it or leave it as is.
I have already removed all missing values etc.
So I have done a lot of research on regression modelling and according to many articles, some people have done dummy coding or group by rare levels to transform their categorical variables in SAS Enterprise Miner however some have allowed their model to run without transforming their categorical variables.
My question is that is that really necessary, for example, one of my categorical variables is Armed Category, this is something I preprocessed on my own and condensed from another column known as Armed. The armed category consists of approximately 10-11 levels and has 'Blunt Weapon', 'Gun', 'Other Weapon' etc.
Another variable is Region which has 5 levels.
I need to understand why is it so important to transform these?
It is not necessary for you to program or create dummy coding. All SAS modeling PROCs will do this for you behind the scenes for categorical variables, so you don't have to create the dummy coding yourself. But that's how categorical variables are handled.
Combining categories is a judgment call by the data analyst (that's you). If two categories are similar from a subject matter point of view, you can decide to combine them. Another reason to combine categories (if it makes sense to do so), is that there isn't a lot of data in each category, but if you combine them then you have more data in the combined category, and maybe a better fitting model.
Can't we simply leave them as is and let the regression node run and produce the results?
Certainly. There's no way of knowing which method (combining or not combining) will work best until you try them. That's one of the really nice features of Enterprise Miner, it's pretty simple to try many models and see what works best.
I know for certain interval variables we need to possibly log them and my only one is Age, which I am also a little confused about whether I should go ahead and bin it or leave it as is.
Binning again is up to your judgment, and as stated above, its easy to try with and without binning and see what works. I doubt anyone can say a priori that binning will always produce better results on your data, I think sometimes it will produce better results and sometimes it won't produce better results, and there's only one way to find out.
I do think some people have a "process" in mind for doing these models that either includes binning, or doesn't include binning, and they don't try the other possibility, and that's fine too, as it is faster to just go ahead and fit the model with binning (or not binning). I happen to fall in the not-binning mind-set most of the time, but I can see advantages of binning.
Firstly, thank you for answering my post! I really appreciate it,
and also in regards to your answer, so I understand that yes combining I can do on my own basis if I see fit through perhaps the Replacement Node
however, I was referring to the Transform Variable node specifically in SAS EM, where the variables are dummy-coded or grouped.
Okay, then I don't really know what your question is about the Transform Variable node, which was not specifically mentioned anywhere in your original problem statement.
I have, I mentioned I want to understand about 'Transforming' a variable. The Replacement doesn't have any option to dummy code. Thank you for your input, I genuinely do appreciate the help.
@b_smsha wrote:
I have, I mentioned I want to understand about 'Transforming' a variable. The Replacement doesn't have any option to dummy code. Thank you for your input, I genuinely do appreciate the help.
The highlighted text does not mention the "Transform Variable Node". Your post mentions transforming but not where/when/how or why. You are requiring us to assume how you did something. Which is often a sub-optimal approach to many things.
Transforming might be done for a large number of reasons. One might be because you data doesn't exactly match "literature" you have searched and want to see if your data transformed to look like that in the literature behaves the same.
Another reason may be sample size. If you only have 100 records and one of the categorical variables takes on 50 values then you will likely have issues with fitting that raw variable in many models. Numeric values with wide ranges of value such as income that might range from 1,000 to 1,000,000,000 may get unusual results so something to reduce the influence of a few very large (or very small) values might be needed. And that just scratches the surface. What is appropriate with your data? We don't know. We don't have your data or much of a description.
One of the nice things about computers and code is you can do some of these things very simply and quickly and see what happens.
Oh I understand, I am sorry for my mistake in regards to that. My confusion is mainly with the categorical variables like region name, threat level.
For example Region has 5 levels which are midwest, southwest, west, northwest, northeast,
threat level has 3 levels which are attack, other and undetermined.
My concern was that by transforming these, I notice I get dummy variables of another column with each of the variables within them having their separate columns. So in conclusion I get 8 new columns with 2 levels each. How is this beneficial for our analysis? Does it provide a better predictive model or is it better to forego it instead?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.