Suppose I have insurance dataset with 2 categorical predictive variables: Gender (F/M) and Credit (A/B/C/D/E)
I also have exposure variable Days that I will use as the weight in PROC HPGENSELECT.
PROC HPGENSELECT data=InputData FCONV=1E-8 MAXITER=100 ITSUMMARY;
CLASS Gender Credit;
MODEL Loss = Gender Credit / dist= Tweedie (p=1.6) link=log;
WEIGHT Days;
ODS OUTPUT ParameterEstimates= PEs;
RUN;
PROC PRINT DATA = PEs; RUN;
From 37108 - Setting reference levels for CLASS predictor variables (sas.com) I know that by default the levels are arranged in ascending alphanumeric order -> so M will become the base level for Gender, and E will become the base level for Credit.
However, the prevalent classes using exposure variable Days are Gender = F and Credit = B.
For example, I can use PROC SUMMARY to determine the prevalent class for each predictive variable:
PROC SUMMARY data=InputData SUM PRINT MISSING;
CLASS Gender;
VAR Days;
RUN;
... and then specify the preferred reference levels in the CLASS statement:
PROC HPGENSELECT data=InputData FCONV=1E-8 MAXITER=100 ITSUMMARY;
CLASS Gender(ref = "F") Credit(ref = "B");
MODEL Loss = Gender Credit / dist= Tweedie (p=1.6) link=log;
WEIGHT Days;
ODS OUTPUT ParameterEstimates= PEs;
RUN;
PROC PRINT DATA = PEs; RUN;
If I have 10 more categorical predictive variables, is there an elegant way to avoid PROC SUMMARY, pass exposure variable Days to PROC HPGENSELECT, and request PROC HPGENSELECT for each categorical predictive variable use the level with the highest exposure as the base?
Thanks for the insights!
Thanks for your response, StatDave! Yes, options ORDER = FREQ and DESCENDING in the CLASS statement CLASS Statement :: SAS/STAT(R) 12.3 User's Guide: High-Performance Procedures would work if I wanted to select the base level using highest frequency of Gender. However, I need to consider 2nd variable - Days - to determine the prevalent class. For example, I'd like "F" to be the base class for Gender because it has higher sum(Days), even though "M" has higher _FREQ_
Obs | Gender | _FREQ_ | Days |
1 | F | 4,000 | 810,000 |
2 | M | 4,821 | 790,560 |
In my case, I decided to continue to use the approach from the original post: PROC SUMMARY to determine the prevalent class for each predictive variable, and then specify the preferred reference levels in the CLASS statement.
Hello @Bear85 ,
I see ...
Note that you can do all that
PROC SUMMARY + PROC HPGENSELECT
with proper base levels for CLASS variables
in ONE GO (without any manual intervention)!
You can do that with some macro coding or with data-driven code generation in a data step.
Good luck,
Koen
The level should match the standard as bik is matching the standard of market as an conversational tool
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.