Hi,
I'm a bit noob with scoring using proc logistic.
I want a dataset that includes probabilities for all possible combination of categories of variables used in a logistic model.
I built the dataset to score as follows:
data work.immigdataset;
do cohort2=0 to 98;
do sex=0,1;
do period=0,1;
do pob_num=0 to 7;
output;
end;
end;
end;
end;
run;
The logit model with the score statement is then:
proc logistic data=work.immigrants_from_lfs;
class sex pob_num(ref='0') period /param=ref;
model edunum(descending)= period sex|cohort2 pob_num|cohort2 /unequalslopes;
weight weight / norm;
score data=work.immigdataset out=work.scored_immig;
run;
Edunum has 3 categories and the model is an ordered logit with unequal slopes.
There are no error messages. However, the problem is that the scoring is not done for each second row (see screen shot below). How can I fix this?
I don't see any obvious problem with parameters of the logit model.
Analysis of Maximum Likelihood Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | edunum | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 2 | 1 | -1.7993 | 0.0814 | 489.0406 | <.0001 | |
Intercept | 1 | 1 | -0.5025 | 0.0742 | 45.9176 | <.0001 | |
period | 0 | 2 | 1 | -0.2697 | 0.0127 | 451.0235 | <.0001 |
period | 0 | 1 | 1 | -0.2942 | 0.0139 | 449.2595 | <.0001 |
sex | 0 | 2 | 1 | 1.5305 | 0.0868 | 311.0672 | <.0001 |
sex | 0 | 1 | 1 | 1.4247 | 0.0855 | 277.4533 | <.0001 |
cohort2 | 2 | 1 | 0.0190 | 0.000992 | 366.2465 | <.0001 | |
cohort2 | 1 | 1 | 0.0204 | 0.000924 | 485.8860 | <.0001 | |
cohort2*sex | 0 | 2 | 1 | -0.0195 | 0.00104 | 349.6493 | <.0001 |
cohort2*sex | 0 | 1 | 1 | -0.0172 | 0.00103 | 277.7880 | <.0001 |
pob_num | 1 | 2 | 1 | -1.0595 | 0.1789 | 35.0924 | <.0001 |
pob_num | 1 | 1 | 1 | -2.3889 | 0.1596 | 224.0873 | <.0001 |
pob_num | 2 | 2 | 1 | 1.6704 | 0.1560 | 114.6657 | <.0001 |
pob_num | 2 | 1 | 1 | 1.1015 | 0.1442 | 58.3199 | <.0001 |
pob_num | 3 | 2 | 1 | 1.8591 | 0.1383 | 180.7171 | <.0001 |
pob_num | 3 | 1 | 1 | 0.4209 | 0.1334 | 9.9507 | 0.0016 |
pob_num | 4 | 2 | 1 | 0.7457 | 0.1458 | 26.1710 | <.0001 |
pob_num | 4 | 1 | 1 | -0.1724 | 0.1411 | 1.4935 | 0.2217 |
pob_num | 5 | 2 | 1 | 2.0834 | 0.2466 | 71.3532 | <.0001 |
pob_num | 5 | 1 | 1 | -0.6860 | 0.2495 | 7.5600 | 0.0060 |
pob_num | 6 | 2 | 1 | 4.3626 | 0.2696 | 261.7725 | <.0001 |
pob_num | 6 | 1 | 1 | 4.2384 | 0.4575 | 85.8354 | <.0001 |
pob_num | 7 | 2 | 1 | 2.3235 | 0.1332 | 304.4506 | <.0001 |
pob_num | 7 | 1 | 1 | 0.6312 | 0.1351 | 21.8367 | <.0001 |
cohort2*pob_num | 1 | 2 | 1 | 0.00791 | 0.00213 | 13.7568 | 0.0002 |
cohort2*pob_num | 1 | 1 | 1 | 0.0210 | 0.00193 | 118.7234 | <.0001 |
cohort2*pob_num | 2 | 2 | 1 | -0.0254 | 0.00188 | 182.9621 | <.0001 |
cohort2*pob_num | 2 | 1 | 1 | -0.0217 | 0.00173 | 156.8366 | <.0001 |
cohort2*pob_num | 3 | 2 | 1 | -0.0239 | 0.00167 | 205.5414 | <.0001 |
cohort2*pob_num | 3 | 1 | 1 | -0.0115 | 0.00161 | 51.0017 | <.0001 |
cohort2*pob_num | 4 | 2 | 1 | -0.00618 | 0.00174 | 12.5973 | 0.0004 |
cohort2*pob_num | 4 | 1 | 1 | -0.00090 | 0.00170 | 0.2826 | 0.5950 |
cohort2*pob_num | 5 | 2 | 1 | -0.0125 | 0.00287 | 18.9461 | <.0001 |
cohort2*pob_num | 5 | 1 | 1 | 0.0195 | 0.00299 | 42.8260 | <.0001 |
cohort2*pob_num | 6 | 2 | 1 | -0.0338 | 0.00324 | 109.0780 | <.0001 |
cohort2*pob_num | 6 | 1 | 1 | -0.0270 | 0.00549 | 24.1746 | <.0001 |
cohort2*pob_num | 7 | 2 | 1 | -0.0246 | 0.00162 | 230.6967 | <.0001 |
cohort2*pob_num | 7 | 1 | 1 | -0.00275 | 0.00166 | 2.7353 | 0.0982 |
What is L_EDUNUM in your screen capture? It's not in your model statement.
@PaigeMiller I_edunum is a manufactured variable that PROC LOGISTIC creates in the output scoring data set. If your response variable is named Y, you get a variable named I_Y.
@Rick_SAS wrote:
@PaigeMiller I_edunum is a manufactured variable that PROC LOGISTIC creates in the output scoring data set. If your response variable is named Y, you get a variable named I_Y.
Don't leave me guessing. What is the purpose of l_Y? How is it computed? What is it telling us?
Paige, the link in my response takes you to the documentation.
What version of SAS are you running? Submit
%put &=SYSVLONG4;
and paste in the result that appears in the log.
I suspect that the problem is your data. run PROC FREQ on your pob_num variable. Do you have 8 categories? I suspect you might have only the even categories 0, 2, 4, and 6. If you have all eight categories of pob_num, then check the WEIGHT variable, pay attention to missing values or zero values. Perhaps odd values of pob_num all have zero or missing weights?
The following program runs your code on simulated data. When I run it, it produces a scoring data set for which all observations are scored. Make sure your version of SAS treats this simulated data correctly. If so, there is something wrong with your data.
data have;
call streaminit(1);
do cohort2=0 to 98;
do sex=0,1;
do period=0,1;
do pob_num=0 to 7;
edunum = rand("Table", 0.2, 0.5, 0.3) - 1;
weight = rand("uniform");
output;
end;
end;
end;
end;
run;
data immigdataset;
do cohort2=0 to 98;
do sex=0,1;
do period=0,1;
do pob_num=0 to 7;
output;
end;
end;
end;
end;
run;
%put &=SYSVLONG4;
proc logistic data=have;
class sex pob_num(ref='0') period /param=ref;
model edunum(descending)= period sex|cohort2 pob_num|cohort2 /unequalslopes;
weight weight / norm;
score data=work.immigdataset out=work.scored_immig;
run;
If it can help to find the reason of the issue: when I remove the UNEQUALSLOPES statement, it works. However I need it for my model so this is not a good solution.
In the log of the regression, I also have a warning message saying: "Negative individual predicted probabilities were identified in the final model fit. You may want to modify your UNEQUALSLOPES specification."
I did not find any information about this error message in google and I don't understand how it is possible to predict negative probabilities in a logistic model (edunum has 3 categories: 0,1,2).
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.