Re: roc table with different length than train set?

catkat96 · Posted 03-12-2020 06:27 PM

I ran a logistic regression with train data to fit a scorecard and I tested it with test data, and I also got out a ROC curve and table. Here is my code.

ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.TRAIN1 OUTMODEL=webwork.TRAIN1;
	class checking_1 checking_23 property_1 property_23 amount_to1k amount_1kto2k 
		amount_2kto4k purpose_04 purpose_56 purpose_139 / param=glm;
	model Good(event='1')=checking_1 checking_23 property_1 property_23 
		amount_to1k amount_1kto2k amount_2kto4k purpose_04 purpose_56 purpose_139 / 
		link=logit selection=backward slstay=0.05 hierarchy=single technique=fisher OUTROC=work.rocdata;
run;

proc logistic inmodel=webwork.TRAIN1;
	score data=work.TEST1 out=SCORED_logit1 outroc=work.rocdata;
run;

The output of the table (rocdata) is strange, cause it gives me just 9 rows of information and not 90 (the test set length).

Can anyone explain me why this is and if I could change it so that it showed all rows? I am looking into this because I calculated the ROC curve myself on excel and the result is quite diferent (0.78 and 0.6 different), so I want to figure out where's the difference.

EDIT: here's the log from the regression.

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 NOTE: ODS statements in the SAS Studio environment may disable some output features.
 73         
 74         /*
 75          *
 76          * Task code generated by SAS Studio 3.8
 77          *
 78          * Generated on '3/11/20, 5:31 PM'
 79          * Generated by 'sasdemo'
 80          * Generated on server 'LOCALHOST'
 81          * Generated on SAS platform 'Linux LIN X64 2.6.32-754.6.3.el6.x86_64'
 82          * Generated on SAS version '9.04.01M6P11072018'
 83          * Generated on browser 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
 83       ! Chrome/80.0.3987.132 Safari/537.36'
 84          * Generated on web client 'http://localhost:10080/SASStudio/38/main?locale=en&zone=GMT%252B00%253A00'
 85          *
 86          */
 87         
 88         ods noproctitle;
 89         ods graphics / imagemap=on;
 90         
 91         proc logistic data=WORK.TRAIN1 OUTMODEL=webwork.TRAIN1;
 92         class checking_1 checking_23 property_1 property_23 amount_to1k amount_1kto2k
 93         amount_2kto4k purpose_04 purpose_56 purpose_139 / param=glm;
 94         model Good(event='1')=checking_1 checking_23 property_1 property_23
 95         amount_to1k amount_1kto2k amount_2kto4k purpose_04 purpose_56 purpose_139 /
 96         link=logit selection=backward slstay=0.05 hierarchy=single technique=fisher OUTROC=work.rocdata ;
 97         ODS OUTPUT ParameterEstimates=logit1_estimates;
 98         run;
 
 NOTE: PROC LOGISTIC is modeling the probability that Good='1'.
 NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 0.
 NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 1.
 NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 2.
 NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 3.
 NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 4.
 NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 5.
 NOTE: Convergence criterion (GCONV=1E-8) satisfied in Step 6.
 NOTE: The data set WORK.LOGIT1_ESTIMATES has 9 observations and 8 variables.
 NOTE: There were 269 observations read from the data set WORK.TRAIN1.
 NOTE: The data set WEBWORK.TRAIN1 has 346 observations and 6 variables.
 NOTE: The data set WORK.ROCDATA has 320 observations and 8 variables.
 NOTE: PROCEDURE LOGISTIC used (Total process time):
       real time           2.82 seconds
       cpu time            1.64 seconds
       
 
 99         
 100        proc logistic inmodel=webwork.TRAIN1;
 101        score data=work.TEST1 out=SCORED_logit1 outroc=work.rocdata;
 102        run;
 
 NOTE: The data set WORK.SCORED_LOGIT1 has 90 observations and 16 variables.
 NOTE: The data set WORK.ROCDATA has 9 observations and 7 variables.
 NOTE: PROCEDURE LOGISTIC used (Total process time):
       real time           0.05 seconds
       cpu time            0.03 seconds
       
 
 102      !     
 103        
 104        
 105        
 106        
 107        OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 119

ballardw · Posted 03-12-2020 06:40 PM

You should post the log with any notes included. Copy the text from the log after re-running the regression and then paste here into a code box opened with the {I} or "running man" icon to preserve the appearance of the text.

If your training data did not have all values of the class variables that appear in the other data that could be one issue. Also the combinations of values can come into play.

catkat96 · Posted 03-12-2020 07:52 PM

Edited the original post to include the log.

And it does have all the values. The variables are all dummies and they all appear in both datasets.

ballardw · Posted 03-13-2020 10:55 AM

I would suggest running the following code:

proc freq data=work.train1;
   title "From train set";
   tables  checking_1*checking_23* property_1* property_23 *amount_to1k* amount_1kto2k*
          amount_2kto4k* purpose_04* purpose_56* purpose_139 /list missing;
run;

proc freq data=work.test1;
   title "From test set";
   tables  checking_1*checking_23* property_1* property_23 *amount_to1k* amount_1kto2k*
          amount_2kto4k* purpose_04* purpose_56* purpose_139 /list missing;
run;

If the result outputs do not have the same combinations check to see if the SCORED data only has results that match the combinations that do show in both of the proc freq outputs.

I very strongly suspect that you have combinations of class variables in the TEST data that do not exist in the training set. In which case no matching score parameters were created and those observations cannot be scored.

With 10 class variables if each variable has 2 values that is 2^10, or 1024 combinations of values. Hard to get with only 296 observations in the training data set.

StatDave · Posted 03-16-2020 01:38 PM

The OUTROC= data set will only contain the distinct predicted probabilities from your data - it will not necessarily have as many observations as your input data. Note that binning (grouping) of the predicted probabilities is done. This is controlled by the ROCEPS= option.

roc table with different length than train set?