Is it better to use the race input variable in its original form or co...

b_smsha · Posted 06-30-2021 10:01 AM

EDIT: ***

Based on some of the answers, I guess its better to keep race in normal format for the modeling

Race - Target (originally was 6 levels but is now converted to Binary 0,1)

The original Race column is known as 'Race_Label'.

Mental_Illness - Target Binary

Firstly I am using both as my target variables and building separate predictive models to analyze their relationship with input variables. The other input variables I think right is irrelevant to my question so I am only talking about both my targets and what I want to try and do with them.

My first predictive model to build is using my 'mental illness as a target to create a decision tree and regression model, it includes the use of the race variable as an input as I have to analyze its relation to mental illness. My question is that should I use my race variable as the converted binary variable for the model or use it as it is in its original format?

This is how I am trying to choose the best model. One model is using the converted race variable as an input while the other model is using race as a nominal (6 level) input variable. I am trying to see the difference but I am getting confused as I feel like I am wasting time trying to see what option would be better.

My data partition is set at 70/30.

And below are the outputs of one of the types of decision tree model (Splitting criteria is set at Gini with Decision/Misclassification Rate as the assessment measure or pruning measure) with first the converted race variable as input.

and the second picture shows the original race variable being used as input.

I hope I am making sense. Thank in advance for your help everyone.

PaigeMiller · Posted 06-30-2021 12:00 PM

This is how I am trying to choose the best model. One model is using the converted race variable as an input while the other model is using race as a nominal (6 level) input variable. I am trying to see the difference but I am getting confused as I feel like I am wasting time trying to see what option would be better.

Can you be more specific about what is confusing you? Just saying you are confused doesn't really help us to give a meaningful response.

--
Paige Miller

b_smsha · Posted 06-30-2021 12:20 PM

The confusion is choosing between what method I should go with, whether I should use race as the converted input binary variable or as a nominal categorical variable for my prediction modeling.

ballardw · Posted 06-30-2021 12:33 PM

@b_smsha wrote:
The confusion is choosing between what method I should go with, whether I should use race as the converted input binary variable or as a nominal categorical variable for my prediction modeling.

What do the levels of that "input binary variable" mean? Assuming that the values are 1 and 0, what does a 1 indicate and what does a 0 indicate?

Can you provide an example of what might be report text using this variable? This would not be based on your data just a plan concept. Do you want to report something like "Race A had more/fewer/ same proporiton/different proportion of Result 1 than Race B" or "Race A had the highest/lowest rate for Result than all other races combined".

My gut feeling without any details is that the binary variable likely isn't the place to deal with "race" as an independent variable.

b_smsha · Posted 06-30-2021 12:51 PM

Ah yes so originally my target race was a nominal variable which had 6 levels - Black, White, Asian, Native, Hispanic and Other and after research, I understood that SAS EM isn't suited for multinomial analysis so I converted this into a binary target variable in SAS Studio with it becoming 1 = Black, Native and Asian and 0 = White, Hispanic and Other. so its easier to do predictive modeling for race as target variable.

Now if i want to use the race as independent to do modeling for mental illness as a target, this would be an example output

So for example, if I was using race as a binary and I modelled a decision tree it would say for ex;

if race = 1 and if ... then

mental illness 0 = 0.6

mental illness 1 = 0.4

it would tell me that races white, Hispanic and other were more likely to not have a mental illness based on,.etcetc

Overall, I just wanted to see some opinions on what would be better suited, to go with race in its original format or binary format as an independent variable to do predictive modelling for mental illness as a target.

Reeza · Posted 06-30-2021 12:56 PM

If Mental Illness is your target it doesn't matter to some degree. The degree being how you specified the parameterization of the nominal variable. If you used the default I think it uses a GLM method which is not the same as the binary but your model should be the same. If you used the REF parameterization method then it will be the exact same if you used either variable. SAS will parameterize the variable for you automatically so there's usually no need to create the dummy variables manually unless you have multiple races data then use your manually created indicator variables.

Reeza · Posted 06-30-2021 01:01 PM

I think you (or me) are misusing the term 'target variable'. The target variable is what you're trying to predict but you also refer to race as a target variable??

Ah yes so originally my target race was a nominal variable which had 6 levels - Black, White, Asian, Native, Hispanic and Other and after research, I understood that SAS EM isn't suited for multinomial analysis so I converted this into a binary target variable in SAS Studio with it becoming 1 = Black, Native and Asian and 0 = White, Hispanic and Other. so its easier to do predictive modeling for race as target variable.

Are you trying to predict mental illness based on race? Vice versa doesn't make much sense to me....

b_smsha · Posted 06-30-2021 01:10 PM

I have to do prediction with both mental illness and race as target variables (separate predictive models)

this is the original dataset without any cleaning, I didn't wanna link it because I've changed something but these are the variables. https://github.com/washingtonpost/data-police-shootings

I have to sort of something like this

https://towardsdatascience.com/an-examination-of-fatal-force-by-police-in-the-us-db897d97085c

The person above first did a prediction or classification model with race as a target, then he did more models with mental illness as a target. So that's the whole gist of the project I need to do.

Im using the above dataset to do a predictive model based on each mental illness and race separately and seeing what variables most correlate with the target.

Reeza · Posted 06-30-2021 01:39 PM

Both are classifcation type problems then, in one case you have a binary case in the other you have multiple levels. If both are your predictors then you would need them as a single variables not multiple variables. Dummy coding to 0/1 with multiple variables only makes sense when you're dealing with race as a variable in your model not as a target variable though - since you'd then have a fully multivariate situation which is likely beyond most ML models that I'm aware of at the moment.

Is it better to use the race input variable in its original form or converted form for prediction?

Re: Is it better to use the race input variable in its original form or converted form for predictio

Re: Is it better to use the race input variable in its original form or converted form for predictio

Re: Is it better to use the race input variable in its original form or converted form for predictio

Re: Is it better to use the race input variable in its original form or converted form for predictio

Re: Is it better to use the race input variable in its original form or converted form for predictio

Re: Is it better to use the race input variable in its original form or converted form for predictio

Re: Is it better to use the race input variable in its original form or converted form for predictio

Re: Is it better to use the race input variable in its original form or converted form for predictio