BookmarkSubscribeRSS Feed
gyambqt
Obsidian | Level 7

Hi Experts,

 

I have a very simple question.

 

assume that I have a variable A contains two classes : class 1 and class 2

 

I build a linear regression model or logistic regression using this variable.

 

If I use the model built to score a new dataset contains new classes such as class 4, class 5 etc. How will model handle that?

7 REPLIES 7
PaigeMiller
Diamond | Level 26

Technically, the model you built cannot make predictions for Class 4 or Class 5.

 

The best way to handle this is to build a model that includes all possible classes. If that's not possible, you can (and this idea is very unsatisfying) assign the new classes to one of the existing classes

--
Paige Miller
Rick_SAS
SAS Super FREQ

There are several ways to score a regression model in SAS, but they should all predict a missing value when they encounter a level of a categorical variable that is not part of the original model. For example,  two ways to score a regression model are to use PROC PLM (my preferred method) or the CODE statement, which generates DATA step code. In both cases, the predicted value is missing for levels that are not part of the model:

proc glm data=sashelp.cars;
where Origin in ('USA' 'Asia');  /* Exclude 'Europe' from model */
class Origin;
model mpg_city = origin horsepower;
store out=ScoreExample;     /* store the model */
code file='glmScore.sas';
quit;

/* the scoring data; evaluate model on these values
   This includes an observation with Origin='Europe' */
data ScoreData;
length Origin $6;
input Origin horsepower;
datalines;
USA    300
Asia   350
Europe 280
;
 
/* test scoring with PROC PLM */
proc plm restore=ScoreExample;
   score data=ScoreData out=Pred;  /* evaluate the model on new data */
run;

proc print data=Pred; run;

/* test scoring with DATA step */
data Pred2;
set ScoreData;
%include 'glmScore.sas';
run;
proc print data=Pred2; run;

Region Capture.png

gyambqt
Obsidian | Level 7
What if you have more than 1 categories not seen in the original model? Your example has only assumed one unseen category.
PaigeMiller
Diamond | Level 26

Nothing from @Rick_SAS is limited to just one unseen category, it applies to hundreds of unseen categories as well.

--
Paige Miller
Rick_SAS
SAS Super FREQ

My original statement still holds: The scoring method will predict a missing value when it encounters a level of a categorical variable that is not part of the original model.  Doesn't matter how many levels it encounters.

Reeza
Super User
If your model hasn't been trained with that category, it cannot predict for that category.

In those cases, other methods are required - one is to look at a clustering analysis and see which category is most similar and slot it in to that category but it very contextually dependent on how you deal with this situation.
PaigeMiller
Diamond | Level 26

@Reeza wrote:
If your model hasn't been trained with that category, it cannot predict for that category.

In those cases, other methods are required - one is to look at a clustering analysis and see which category is most similar and slot it in to that category but it very contextually dependent on how you deal with this situation.

This assumes that category 3 can be clustered with a similar category, in which case it's a reasonable thing to do; but there's no guarantee that a new category does cluster well with other categories.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 718 views
  • 6 likes
  • 4 in conversation