I ran a logistic regression with train data to fit a scorecard and I tested it with test data, and I also got out a ROC curve and table. Here is my code.
ods noproctitle;
ods graphics / imagemap=on;
proc logistic data=WORK.TRAIN1 OUTMODEL=webwork.TRAIN1;
class checking_1 checking_23 property_1 property_23 amount_to1k amount_1kto2k
amount_2kto4k purpose_04 purpose_56 purpose_139 / param=glm;
model Good(event='1')=checking_1 checking_23 property_1 property_23
amount_to1k amount_1kto2k amount_2kto4k purpose_04 purpose_56 purpose_139 /
link=logit selection=backward slstay=0.05 hierarchy=single technique=fisher OUTROC=work.rocdata;
run;
proc logistic inmodel=webwork.TRAIN1;
score data=work.TEST1 out=SCORED_logit1 outroc=work.rocdata;
run;
The output of the table (rocdata) is strange, cause it gives me just 9 rows of information and not 90 (the test set length).
Can anyone explain me why this is and if I could change it so that it showed all rows? I am looking into this because I calculated the ROC curve myself on excel and the result is quite diferent (0.78 and 0.6 different), so I want to figure out where's the difference.
EDIT: here's the log from the regression.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; NOTE: ODS statements in the SAS Studio environment may disable some output features. 73 74 /* 75 * 76 * Task code generated by SAS Studio 3.8 77 * 78 * Generated on '3/11/20, 5:31 PM' 79 * Generated by 'sasdemo' 80 * Generated on server 'LOCALHOST' 81 * Generated on SAS platform 'Linux LIN X64 2.6.32-754.6.3.el6.x86_64' 82 * Generated on SAS version '9.04.01M6P11072018' 83 * Generated on browser 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 83 ! Chrome/80.0.3987.132 Safari/537.36' 84 * Generated on web client 'http://localhost:10080/SASStudio/38/main?locale=en&zone=GMT%252B00%253A00' 85 * 86 */ 87 88 ods noproctitle; 89 ods graphics / imagemap=on; 90 91 proc logistic data=WORK.TRAIN1 OUTMODEL=webwork.TRAIN1; 92 class checking_1 checking_23 property_1 property_23 amount_to1k amount_1kto2k 93 amount_2kto4k purpose_04 purpose_56 purpose_139 / param=glm; 94 model Good(event='1')=checking_1 checking_23 property_1 property_23 95 amount_to1k amount_1kto2k amount_2kto4k purpose_04 purpose_56 purpose_139 / 96 link=logit selection=backward slstay=0.05 hierarchy=single technique=fisher OUTROC=work.rocdata ; 97 ODS OUTPUT ParameterEstimates=logit1_estimates; 98 run; NOTE: PROC LOGISTIC is modeling the probability that Good='1'. NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 0. NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 1. NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 2. NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 3. NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 4. NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 5. NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 6. NOTE: The data set WORK.LOGIT1_ESTIMATES has 9 observations and 8 variables. NOTE: There were 269 observations read from the data set WORK.TRAIN1. NOTE: The data set WEBWORK.TRAIN1 has 346 observations and 6 variables. NOTE: The data set WORK.ROCDATA has 320 observations and 8 variables. NOTE: PROCEDURE LOGISTIC used (Total process time): real time 2.82 seconds cpu time 1.64 seconds 99 100 proc logistic inmodel=webwork.TRAIN1; 101 score data=work.TEST1 out=SCORED_logit1 outroc=work.rocdata; 102 run; NOTE: The data set WORK.SCORED_LOGIT1 has 90 observations and 16 variables. NOTE: The data set WORK.ROCDATA has 9 observations and 7 variables. NOTE: PROCEDURE LOGISTIC used (Total process time): real time 0.05 seconds cpu time 0.03 seconds 102 ! 103 104 105 106 107 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 119
You should post the log with any notes included. Copy the text from the log after re-running the regression and then paste here into a code box opened with the {I} or "running man" icon to preserve the appearance of the text.
If your training data did not have all values of the class variables that appear in the other data that could be one issue. Also the combinations of values can come into play.
Edited the original post to include the log.
And it does have all the values. The variables are all dummies and they all appear in both datasets.
I would suggest running the following code:
proc freq data=work.train1; title "From train set"; tables checking_1*checking_23* property_1* property_23 *amount_to1k* amount_1kto2k* amount_2kto4k* purpose_04* purpose_56* purpose_139 /list missing; run; proc freq data=work.test1; title "From test set"; tables checking_1*checking_23* property_1* property_23 *amount_to1k* amount_1kto2k* amount_2kto4k* purpose_04* purpose_56* purpose_139 /list missing; run;
If the result outputs do not have the same combinations check to see if the SCORED data only has results that match the combinations that do show in both of the proc freq outputs.
I very strongly suspect that you have combinations of class variables in the TEST data that do not exist in the training set. In which case no matching score parameters were created and those observations cannot be scored.
With 10 class variables if each variable has 2 values that is 2^10, or 1024 combinations of values. Hard to get with only 296 observations in the training data set.
The OUTROC= data set will only contain the distinct predicted probabilities from your data - it will not necessarily have as many observations as your input data. Note that binning (grouping) of the predicted probabilities is done. This is controlled by the ROCEPS= option.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.