Hi Experts,
I have a very simple question.
assume that I have a variable A contains two classes : class 1 and class 2
I build a linear regression model or logistic regression using this variable.
If I use the model built to score a new dataset contains new classes such as class 4, class 5 etc. How will model handle that?
Technically, the model you built cannot make predictions for Class 4 or Class 5.
The best way to handle this is to build a model that includes all possible classes. If that's not possible, you can (and this idea is very unsatisfying) assign the new classes to one of the existing classes
There are several ways to score a regression model in SAS, but they should all predict a missing value when they encounter a level of a categorical variable that is not part of the original model. For example, two ways to score a regression model are to use PROC PLM (my preferred method) or the CODE statement, which generates DATA step code. In both cases, the predicted value is missing for levels that are not part of the model:
proc glm data=sashelp.cars;
where Origin in ('USA' 'Asia'); /* Exclude 'Europe' from model */
class Origin;
model mpg_city = origin horsepower;
store out=ScoreExample; /* store the model */
code file='glmScore.sas';
quit;
/* the scoring data; evaluate model on these values
This includes an observation with Origin='Europe' */
data ScoreData;
length Origin $6;
input Origin horsepower;
datalines;
USA 300
Asia 350
Europe 280
;
/* test scoring with PROC PLM */
proc plm restore=ScoreExample;
score data=ScoreData out=Pred; /* evaluate the model on new data */
run;
proc print data=Pred; run;
/* test scoring with DATA step */
data Pred2;
set ScoreData;
%include 'glmScore.sas';
run;
proc print data=Pred2; run;
Nothing from @Rick_SAS is limited to just one unseen category, it applies to hundreds of unseen categories as well.
My original statement still holds: The scoring method will predict a missing value when it encounters a level of a categorical variable that is not part of the original model. Doesn't matter how many levels it encounters.
@Reeza wrote:
If your model hasn't been trained with that category, it cannot predict for that category.
In those cases, other methods are required - one is to look at a clustering analysis and see which category is most similar and slot it in to that category but it very contextually dependent on how you deal with this situation.
This assumes that category 3 can be clustered with a similar category, in which case it's a reasonable thing to do; but there's no guarantee that a new category does cluster well with other categories.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.