Hi, I have built a Gradient Boosting Model using SAS EMiner 13.1 on data with 1.8% event rate { target is binary variable }. Model results are good and I wanted to test the model on Out of time. Hence applied the GB scoring code on data set generated on a different time frame. After running scoring code, I wanted to check rank ordering if it still holds good.{ not sure if this is expected on machine learning models, It's done on traditional logit models for stability} data temp1; set Scored_gb_dataset ; proc sort data=temp1; by descending EM_EVENTPROBABILITY; run; data temp2 (drop = i count); set temp1 nobs = size; count + 1; do i = 1 to 10; if (i-1) * (size/10) < count <= i * (size/10) then decile = i; end; run; proc freq data = temp2 formchar = ' '; tables decile * actual_target /nocum norow nocol nopercent; run; I sorted the data based on EM_EVENT PROBABILITY and created deciles based on number of observations and I have checked # of actual responders by decile to see rank ordering and it breaks on 4th decile. However, it does capture ~75% events on top 3 deciles. Usually for classification models like decision tree's, they would be classified as High/Medium/Low risk segments and events captured by these H/M/L segments could indicate validity on out of time validation. But here probability is assigned for each observation or ID I think. Should we expect rank ordering to hold good on out of time samples for machine learning classification models? Appreciate your help/thoughts on the same
... View more