SAS Data Science

roushankumar · Posted 09-06-2018 08:25 PM

My data is in the following format

NAME SALARY DEPT rank1

Dan 623.3 HR 1

Dan 515.2 HR 0

Dan 611 HR 1

Dan 729 HR 2

Rick 843.25 IT 2

Rick 578 IT 0

Rick 632.8 IT 1

Rick 722.5 IT 1

I want to use a multiclass classification model to predict rank1. SALARY and DEPT are my independent variables. Most algorithms score using the class that had the highest probability with respect to a reference level. But, here I need the probability of all levels. I also need to build the model by NAME, so each distinct NAME will have its own coefficients after training the model. I am ok with using Logistic regression, KNN, Naive Bayes or any other algorithm you suggest. My result should give the probability of each rank1 for the scoring data. When I score a new data (of the same format as my training data) , I should get result in the following format

NAME SALARY DEPT rank1_0 rank1_1 rank1_2

Dan 711 HR 0.25 0.6 0.15

Rick 819 IT 0.2 0.3 0.5

Dan 743 HR 0.1 0.2 0.7

Rick 688 IT 0.3 0.3 0.4

columns rank1_0, rank1_1 and rank1_2 have the probability of classes 0,1 and 2 respectively.

Thanks for the help!

Reeza · Posted 09-07-2018 11:44 AM

I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm

I think that depends on parameterization method. Check the GLM option.

@roushankumar wrote:

Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables.

You wouldn't be getting the predicted variables, you're looking to score a particular data set so you'd score that to get your generic probabilities. The number of variables doesn't really matter to the scoring that much.

Do you have SAS EM? Base SAS can do logistic regression but for any of the other options you need SAS Enterprise Miner.

View solution in original post

Reeza · Posted 09-06-2018 10:16 PM

You can use multinomial regression. There's a tutorial here on that:

https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression/

I'm assuming of course that you have significantly more data, having only 3 or 4 observations per name will be problematic to predict 3 levels of rank - as in likely not possible or reliable.

Using a BY statement on NAME will ensure custom models for each NAME.

Why does your output have two rows for Rick and Dan each?

@roushankumar wrote:

My data is in the following format

      NAME             SALARY DEPT           rank1

      Dan                 623.3              HR                1

      Dan                 515.2              HR                0

      Dan                 611                HR                1

      Dan                 729                HR                2

      Rick                843.25             IT                2

      Rick                578 IT                0

      Rick                632.8              IT                1

      Rick                722.5              IT                1

I want to use a multiclass classification model to predict rank1. SALARY and DEPT are my independent variables. Most algorithms score using the class that had the highest probability with respect to a reference level. But, here I need the probability of all levels. I also need to build the model by NAME, so each distinct NAME will have its own coefficients after training the model. I am ok with using Logistic regression, KNN, Naive Bayes or any other algorithm you suggest. My result should give the probability of each rank1 for the scoring data. When I score a new data (of the same format as my training data) , I should get result in the following format

      NAME SALARY DEPT     rank1_0   rank1_1    rank1_2

      Dan          711 HR 0.25 0.6 0.15

      Rick         819 IT 0.2 0.3 0.5

      Dan          743                HR        0.1 0.2 0.7

      Rick         688 IT 0.3 0.3 0.4

columns rank1_0, rank1_1 and rank1_2 have the probability of classes 0,1 and 2 respectively.

Thanks for the help!

roushankumar · Posted 09-07-2018 11:31 AM

Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables. The data I showed is a mocked up one. I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm. To your question 'Why does your output have two rows for Rick and Dan each? ' The NAME Rick and Dan will not be used in scoring but they will be used to join the scoring data with the right set of coefficients as we will have one set of coefficients for each NAME.

Reeza · Posted 09-07-2018 11:44 AM

I believe logistic regression always generates coefficients with respect to a reference level and hence I said I am open to looking at any other algorithm

I think that depends on parameterization method. Check the GLM option.

@roushankumar wrote:

Thanks for looking into this. I did come across the resource earlier. The probability computations in the blog you suggested are done looking at one variable and keeping other variable constant. That will generate many probabilities for each row in my scoring data as I have many variables.

You wouldn't be getting the predicted variables, you're looking to score a particular data set so you'd score that to get your generic probabilities. The number of variables doesn't really matter to the scoring that much.

Do you have SAS EM? Base SAS can do logistic regression but for any of the other options you need SAS Enterprise Miner.

SAS Data Science

Generating probabilities for each class in a multi-class classification problem in SAS

Re: Generating probabilities for each class in a multi-class classification problem in SAS

Re: Generating probabilities for each class in a multi-class classification problem in SAS

Re: Generating probabilities for each class in a multi-class classification problem in SAS

Re: Generating probabilities for each class in a multi-class classification problem in SAS

[SAS 활용 노하우] 확률분포(Probability Distriburtion)

[SAS 활용 노하우] Generalized Linear Models

Sharing reports, Generative AI Intelligent Assistants | SAS Viya Decem...

Image classification, model training and streaming data using Rubik's ...

Generating Better Synthetic Data

Follow Us

What is...

SAS Data Science

Our biggest data and AI event of the year.

Follow Us

What is...