Hello SAS experts
I'm using PROC HPGENSELECT to model heavily zero-inflated health insurance "out of pocket"(OOP) costs to health plan members. I'm puzzled why I can generate predictions with 5 explanatory variables but not seven. I just get blanks in the predicted column using the second code:
proc hpgenselect data = OOP_logistic_CD technique=congra maxiter=1000 gconv=1e-4;
class sex(ref=first) ageGroup5yr(ref=last) char_prov(ref="Mersey-Lyell (TAS)") hospGroup(ref="Melbourne Endoscopy Group Pty Ltd") specialtyName(ref="Psychiatry")/param=ref;
model memberOop= sex ageGroup5yr char_prov hospGroup specialtyName/dist=Tweedie link=log;
code File='scoringparameters.txt';
run;
data ScoringData;
informat sex $1. ageGroup5yr $5. hospGroup $60. specialtyName $60. char_prov $60.;
input sex= ageGroup5yr= hospGroup= specialtyName= char_prov=;
datalines;
sex=F ageGroup5yr=45-49 hospGroup=Ramsay Health Care specialtyName=ENT char_prov=Inner Sydney (NSW)
run;
data Scores;
set ScoringData;
%include 'scoringparameters.txt';
proc print data=Scores;
run;
Here is the code using 7 predictors which just gives the blanks in the "Scores" dataset:
proc hpgenselect data = OOP_logistic_CD technique=congra maxiter=1000 gconv=1e-4;
class itemCatLev1Med(ref=last) sex(ref=first) ageGroup5yr(ref=last) char_prov(ref="Mersey-Lyell (TAS)") hospGroup(ref="Melbourne Endoscopy Group Pty Ltd") hospType(ref="Public" ) specialtyName(ref="Psychiatry")/param=ref;
model memberOop= itemCatLev1Med sex ageGroup5yr char_prov hospGroup hospType specialtyName/dist=Tweedie link=log;
code File='scoringparameters.txt';
run;
data ScoringData;
informat itemCatLev1Med $40. sex $1. ageGroup5yr $5. char_prov $60. hospGroup $60. hospType $40. specialtyName $60.;
input itemCatLev1Med= sex= ageGroup5yr= char_prov= hospGroup= hospType= specialtyName=;
datalines;
itemCatLev1Med=Surgical Operations sex=F ageGroup5yr=45-49 hospGroup=Ramsay Health Care hospType=Private specialtyName=ENT char_prov=Inner Sydney (NSW)
run;
data Scores;
set ScoringData;
%include 'scoringparameters.txt';
proc print data=Scores;
run;
I suspect this is due to some collinearity although it doesn't seem that this should occur.
Thanks for any advice.
Chris
I think this may be due to missing observations in HospType; if I dropped this one the predictions worked ok.
Regards
Chris
Does the log show anything different for the 7 variable model vs the 5 variable model?
If you have missing values for your variables on the model statement such that each record has at least one missing then there isn't any data left to build the model.
From the documentation :
Any observation that has missing values for the response, frequency, weight, offset, or explanatory variables is excluded from the analysis; however, missing values are valid for response and explanatory variables that are specified in the MISSING option in the CLASS statement.
Thanks for your reply.
I couldn't spot anything in the log for the model construction or the predictions construction in a 5,6 or 7 predictor model; it just gives blanks for the predicted Out of pocket expense with HospType(Private,Public,Daystay,Closed) in the model, although the hospType estimators are sensible.
However the predictions work if I drop HospType. It's a shame there are so many missing values. This observation value is easy to discern/record so I'm not sure why there are so many missing. There are 5.5 million observations in the dataset so too many to go through and correct it! I could impute the missing values as all Private I guess(most of them would be) but this would not be entirely accurate.
Regards
Chris
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.