BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
roushankumar
Fluorite | Level 6

My data is in the following format

 

      NAME             SALARY         DEPT           rank1

      Dan                 623.3              HR                1

      Dan                 515.2              HR                0

      Dan                 611                HR                1

      Dan                 729                HR                2

      Rick                843.25             IT                2

      Rick                578                 IT                0

      Rick                632.8              IT                1

      Rick                722.5              IT                1

I want to use a multiclass classification model to predict rank1. SALARY and DEPT are my independent variables. Most algorithms score using the class that had the highest probability with respect to a reference level. But, here I need the probability of all levels. I also need to build the model by NAME, so each distinct NAME will have its own coefficients after training the model. I am ok with using Logistic regression, KNN, Naive Bayes or any other algorithm you suggest. My result should give the probability of each rank1 for the scoring data. When I score a new data (of the same format as my training data) , I should get result in the following format

 

      NAME       SALARY     DEPT     rank1_0   rank1_1    rank1_2  

      Dan          711                HR        0.25         0.6             0.15  

      Rick         819                  IT         0.2          0.3             0.5

      Dan          743                HR        0.1          0.2              0.7

      Rick         688                 IT          0.3          0.3              0.4

 

columns rank1_0, rank1_1 and rank1_2 have the probability of classes 0,1 and 2 respectively.

 

Thanks for the help!

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm

I think that depends on parameterization method. Check the GLM option.

 


@roushankumar wrote:

Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables. 


You wouldn't be getting the predicted variables, you're looking to score a particular data set so you'd score that to get your generic probabilities. The number of variables doesn't really matter to the scoring that much.

 

Do you have SAS EM? Base SAS can do logistic regression but for any of the other options you need SAS Enterprise Miner.

 

 

View solution in original post

3 REPLIES 3
Reeza
Super User

You can use multinomial regression. There's a tutorial here on that:

https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression/

 

I'm assuming of course that you have significantly more data, having only 3 or 4 observations per name will be problematic to predict 3 levels of rank - as in likely not possible or reliable. 

 

Using a BY statement on NAME will ensure custom models for each NAME. 

 

Why does your output have two rows for Rick and Dan each? 

 


@roushankumar wrote:

My data is in the following format

 

      NAME             SALARY         DEPT           rank1

      Dan                 623.3              HR                1

      Dan                 515.2              HR                0

      Dan                 611                HR                1

      Dan                 729                HR                2

      Rick                843.25             IT                2

      Rick                578                 IT                0

      Rick                632.8              IT                1

      Rick                722.5              IT                1

I want to use a multiclass classification model to predict rank1. SALARY and DEPT are my independent variables. Most algorithms score using the class that had the highest probability with respect to a reference level. But, here I need the probability of all levels. I also need to build the model by NAME, so each distinct NAME will have its own coefficients after training the model. I am ok with using Logistic regression, KNN, Naive Bayes or any other algorithm you suggest. My result should give the probability of each rank1 for the scoring data. When I score a new data (of the same format as my training data) , I should get result in the following format

 

      NAME       SALARY     DEPT     rank1_0   rank1_1    rank1_2  

      Dan          711                HR        0.25         0.6             0.15  

      Rick         819                  IT         0.2          0.3             0.5

      Dan          743                HR        0.1          0.2              0.7

      Rick         688                 IT          0.3          0.3              0.4

 

columns rank1_0, rank1_1 and rank1_2 have the probability of classes 0,1 and 2 respectively.

 

Thanks for the help!


 

roushankumar
Fluorite | Level 6

Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables. The data I showed is a mocked up one. I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm. To your question 'Why does your output have two rows for Rick and Dan each? '   The NAME Rick and Dan will not be used in scoring but they will be used to join the scoring data with the right set of coefficients as we will have one set of coefficients for each NAME. 

Reeza
Super User

I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm

I think that depends on parameterization method. Check the GLM option.

 


@roushankumar wrote:

Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables. 


You wouldn't be getting the predicted variables, you're looking to score a particular data set so you'd score that to get your generic probabilities. The number of variables doesn't really matter to the scoring that much.

 

Do you have SAS EM? Base SAS can do logistic regression but for any of the other options you need SAS Enterprise Miner.

 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 2260 views
  • 0 likes
  • 2 in conversation