- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
My data is in the following format
NAME SALARY DEPT rank1
Dan 623.3 HR 1
Dan 515.2 HR 0
Dan 611 HR 1
Dan 729 HR 2
Rick 843.25 IT 2
Rick 578 IT 0
Rick 632.8 IT 1
Rick 722.5 IT 1
I want to use a multiclass classification model to predict rank1. SALARY and DEPT are my independent variables. Most algorithms score using the class that had the highest probability with respect to a reference level. But, here I need the probability of all levels. I also need to build the model by NAME, so each distinct NAME will have its own coefficients after training the model. I am ok with using Logistic regression, KNN, Naive Bayes or any other algorithm you suggest. My result should give the probability of each rank1 for the scoring data. When I score a new data (of the same format as my training data) , I should get result in the following format
NAME SALARY DEPT rank1_0 rank1_1 rank1_2
Dan 711 HR 0.25 0.6 0.15
Rick 819 IT 0.2 0.3 0.5
Dan 743 HR 0.1 0.2 0.7
Rick 688 IT 0.3 0.3 0.4
columns rank1_0, rank1_1 and rank1_2 have the probability of classes 0,1 and 2 respectively.
Thanks for the help!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm
I think that depends on parameterization method. Check the GLM option.
@roushankumar wrote:
Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables.
You wouldn't be getting the predicted variables, you're looking to score a particular data set so you'd score that to get your generic probabilities. The number of variables doesn't really matter to the scoring that much.
Do you have SAS EM? Base SAS can do logistic regression but for any of the other options you need SAS Enterprise Miner.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can use multinomial regression. There's a tutorial here on that:
https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression/
I'm assuming of course that you have significantly more data, having only 3 or 4 observations per name will be problematic to predict 3 levels of rank - as in likely not possible or reliable.
Using a BY statement on NAME will ensure custom models for each NAME.
Why does your output have two rows for Rick and Dan each?
@roushankumar wrote:
My data is in the following format
NAME SALARY DEPT rank1
Dan 623.3 HR 1
Dan 515.2 HR 0
Dan 611 HR 1
Dan 729 HR 2
Rick 843.25 IT 2
Rick 578 IT 0
Rick 632.8 IT 1
Rick 722.5 IT 1
I want to use a multiclass classification model to predict rank1. SALARY and DEPT are my independent variables. Most algorithms score using the class that had the highest probability with respect to a reference level. But, here I need the probability of all levels. I also need to build the model by NAME, so each distinct NAME will have its own coefficients after training the model. I am ok with using Logistic regression, KNN, Naive Bayes or any other algorithm you suggest. My result should give the probability of each rank1 for the scoring data. When I score a new data (of the same format as my training data) , I should get result in the following format
NAME SALARY DEPT rank1_0 rank1_1 rank1_2
Dan 711 HR 0.25 0.6 0.15
Rick 819 IT 0.2 0.3 0.5
Dan 743 HR 0.1 0.2 0.7
Rick 688 IT 0.3 0.3 0.4
columns rank1_0, rank1_1 and rank1_2 have the probability of classes 0,1 and 2 respectively.
Thanks for the help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables. The data I showed is a mocked up one. I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm. To your question 'Why does your output have two rows for Rick and Dan each? ' The NAME Rick and Dan will not be used in scoring but they will be used to join the scoring data with the right set of coefficients as we will have one set of coefficients for each NAME.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm
I think that depends on parameterization method. Check the GLM option.
@roushankumar wrote:
Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables.
You wouldn't be getting the predicted variables, you're looking to score a particular data set so you'd score that to get your generic probabilities. The number of variables doesn't really matter to the scoring that much.
Do you have SAS EM? Base SAS can do logistic regression but for any of the other options you need SAS Enterprise Miner.