EDIT: ***
Based on some of the answers, I guess its better to keep race in normal format for the modeling
Race - Target (originally was 6 levels but is now converted to Binary 0,1)
The original Race column is known as 'Race_Label'.
Mental_Illness - Target Binary
Firstly I am using both as my target variables and building separate predictive models to analyze their relationship with input variables. The other input variables I think right is irrelevant to my question so I am only talking about both my targets and what I want to try and do with them.
My first predictive model to build is using my 'mental illness as a target to create a decision tree and regression model, it includes the use of the race variable as an input as I have to analyze its relation to mental illness. My question is that should I use my race variable as the converted binary variable for the model or use it as it is in its original format?
This is how I am trying to choose the best model. One model is using the converted race variable as an input while the other model is using race as a nominal (6 level) input variable. I am trying to see the difference but I am getting confused as I feel like I am wasting time trying to see what option would be better.
My data partition is set at 70/30.
And below are the outputs of one of the types of decision tree model (Splitting criteria is set at Gini with Decision/Misclassification Rate as the assessment measure or pruning measure) with first the converted race variable as input.
and the second picture shows the original race variable being used as input.
I hope I am making sense. Thank in advance for your help everyone.
This is how I am trying to choose the best model. One model is using the converted race variable as an input while the other model is using race as a nominal (6 level) input variable. I am trying to see the difference but I am getting confused as I feel like I am wasting time trying to see what option would be better.
Can you be more specific about what is confusing you? Just saying you are confused doesn't really help us to give a meaningful response.
@b_smsha wrote:
The confusion is choosing between what method I should go with, whether I should use race as the converted input binary variable or as a nominal categorical variable for my prediction modeling.
What do the levels of that "input binary variable" mean? Assuming that the values are 1 and 0, what does a 1 indicate and what does a 0 indicate?
Can you provide an example of what might be report text using this variable? This would not be based on your data just a plan concept. Do you want to report something like "Race A had more/fewer/ same proporiton/different proportion of Result 1 than Race B" or "Race A had the highest/lowest rate for Result than all other races combined".
My gut feeling without any details is that the binary variable likely isn't the place to deal with "race" as an independent variable.
Ah yes so originally my target race was a nominal variable which had 6 levels - Black, White, Asian, Native, Hispanic and Other and after research, I understood that SAS EM isn't suited for multinomial analysis so I converted this into a binary target variable in SAS Studio with it becoming 1 = Black, Native and Asian and 0 = White, Hispanic and Other. so its easier to do predictive modeling for race as target variable.
Now if i want to use the race as independent to do modeling for mental illness as a target, this would be an example output
So for example, if I was using race as a binary and I modelled a decision tree it would say for ex;
if race = 1 and if ... then
mental illness 0 = 0.6
mental illness 1 = 0.4
it would tell me that races white, Hispanic and other were more likely to not have a mental illness based on,.etcetc
Overall, I just wanted to see some opinions on what would be better suited, to go with race in its original format or binary format as an independent variable to do predictive modelling for mental illness as a target.
I have to do prediction with both mental illness and race as target variables (separate predictive models)
this is the original dataset without any cleaning, I didn't wanna link it because I've changed something but these are the variables. https://github.com/washingtonpost/data-police-shootings
I have to sort of something like this
https://towardsdatascience.com/an-examination-of-fatal-force-by-police-in-the-us-db897d97085c
The person above first did a prediction or classification model with race as a target, then he did more models with mental illness as a target. So that's the whole gist of the project I need to do.
Im using the above dataset to do a predictive model based on each mental illness and race separately and seeing what variables most correlate with the target.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.