Sample 24993: Score new data using a nominal multinomial logistic model
/* The training data set
================================================================*/
data operate;
input hospital trt $ severity $ wt @@;
cards;
1 a none 23 1 a slight 7 1 a moderate 2
1 b none 23 1 b slight 10 1 b moderate 5
1 c none 20 1 c slight 13 1 c moderate 5
1 d none 24 1 d slight 10 1 d moderate 6
2 a none 18 2 a slight 6 2 a moderate 1
2 b none 18 2 b slight 6 2 b moderate 2
2 c none 13 2 c slight 13 2 c moderate 2
2 d none 3 2 d slight 20 2 d moderate 2
3 a none 8 3 a slight 6 3 a moderate 3
3 b none 12 3 b slight 4 3 b moderate 4
3 c none 11 3 c slight 6 3 c moderate 2
3 d none 7 3 d slight 7 3 d moderate 4
4 a none 12 4 a slight 9 4 a moderate 1
4 b none 15 4 b slight 3 4 b moderate 2
4 c none 14 4 c slight 8 4 c moderate 3
4 d none 13 4 d slight 6 4 d moderate 4
;
/* ---------------------- CATMOD method ------------------------- */
/* Fit the model and output the predicted values for each observed
sample. This must be a generalized logit model (no keyword on the
RESPONSE statement before the slash) and all predictors must be
categorical (no DIRECT statement used).
================================================================*/
proc catmod order=data;
weight wt;
response / out=preds;
model severity=trt hospital;
run;
quit;
/* Keep just the predicted values, predictors, and response
================================================================*/
data pred2;
set preds;
if _type_='PROB';
keep severity trt hospital _pred_;
run;
/* Find predicted response level (level with highest predicted
probability) in each sample.
================================================================*/
proc summary data=pred2 nway;
class trt hospital;
var _pred_;
output out=predlvl (drop=_type_ _freq_)
maxid(_pred_(severity))=predlvl;
run;
/* Transpose the predicted values so that there is one observation per
sample containing predicted values for each response level.
================================================================*/
proc transpose data=pred2 out=pred3 (drop=_name_);
by trt hospital;
id severity;
var _pred_;
run;
/* Create a data set for scoring containing various values of the
predictors, including values that were not present in the original
data set for illustration.
================================================================*/
data a;
do n=1 to 100;
hospital=rantbl(239873,.2,.2,.2,.2,.2);
t=rantbl(239873,.2,.2,.2,.2,.2);
if t=1 then trt='a';
else if t=2 then trt='b';
else if t=3 then trt='c';
else if t=4 then trt='d';
else trt='e';
drop t n; output;
end;
run;
/* Sort the data set to be scored (A), the data set of predicted
probabilities (PRED3), and the data set of predicted levels (PREDLVL)
by the predictors.
================================================================*/
proc sort data=a; by hospital trt; run;
proc sort data=pred3; by hospital trt; run;
proc sort data=predlvl; by hospital trt; run;
/* Merge data set to be scored with data set of predicted values in
the order shown.
================================================================*/
data new;
merge a pred3 predlvl;
by hospital trt;
run;
/* Display the scored data set. Note that observations containing
predictor values not in the original data set have missing
predicted values since the model does not have parameters for these
values.
================================================================*/
proc print data=new;
run;
/* --------------------- LOGISTIC method ------------------------ */
/* Beginning in Release 8.2 (TS2M0), the LINK=GLOGIT option in PROC
LOGISTIC allows you to fit the same generalized logit model that PROC
CATMOD fits. Scoring a new data set can be done by simply appending
the data set to the original data, assuring that the response
variable is missing in these observations, and then refitting the
model. The added observations are ignored when estimating the model,
but they are scored by the OUTPUT statement.
The following steps create a data set containing the original data
and the data set to be scored, fits the model, and scores the new
observations.
================================================================*/
data b;
set operate a;
run;
proc logistic;
class trt hospital;
freq wt;
model severity(order=data) = trt hospital / link=glogit;
output out=out predprobs=(i);
run;
proc print;
run;
/* Beginning in SAS 9, scoring a new data set can be done using
the SCORE statement. Specify the training data set (OPERATE) in
the DATA= option in the PROC LOGISTIC statement and the data set to
score (A) in the DATA= option of the SCORE statement.
The following statements score data set A without the need to
concatenate and score the original data as well.
=========================================================================*/
proc logistic data=operate;
class trt hospital;
freq wt;
model severity(order=data) = trt hospital / link=glogit;
score data=a out=out;
run;
proc print;
run;
/* Scoring can also be done at a later time using training model information
stored from a previous run. First, fit the model to the training data
and save the model information.
=========================================================================*/
proc logistic data=operate outmodel=model;
class trt hospital;
freq wt;
model severity(order=data) = trt hospital / link=glogit;
run;
/* Score the validation data set using saved model information.
=========================================================================*/
proc logistic inmodel=model;
score data=a out=out;
run;
proc print;
run;
Contents: Purpose / Requirements / Limitations / See Also
PURPOSE:
In the data set to be scored, only observations containing predictor combinations that occurred in the modeled data set can be scored. This is because the model itself is not evaluated for each observation being scored. Rather, the predicted values that were output for each observed combination (sample) in the modeled data set are simply applied to matching observations in the data set to be scored. Note this limitation means that if you used the DIRECT statement in CATMOD to treat some predictors as continuous, you will not get scores for observations with new values of these variables. For such observations, you will need to use the fitted model parameters to compute predicted probabilities.
LOGISTIC method:
Of course, any observation containing a CLASS variable value that does not appear in the training data cannot be scored since the model does not contain a parameter for that level. However, observations with new continuous predictor values can be scored, unlike the CATMOD method.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.