PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest...

Bear85 · Posted 11-10-2022 06:57 PM

Suppose I have insurance dataset with 2 categorical predictive variables: Gender (F/M) and Credit (A/B/C/D/E)
I also have exposure variable Days that I will use as the weight in PROC HPGENSELECT.

PROC HPGENSELECT data=InputData FCONV=1E-8 MAXITER=100 ITSUMMARY;
     CLASS Gender Credit;  
     MODEL Loss = Gender Credit / dist= Tweedie (p=1.6) link=log;
     WEIGHT Days;
     ODS OUTPUT ParameterEstimates= PEs;
RUN;
PROC PRINT DATA = PEs; RUN;

From 37108 - Setting reference levels for CLASS predictor variables (sas.com) I know that by default the levels are arranged in ascending alphanumeric order -> so M will become the base level for Gender, and E will become the base level for Credit.

However, the prevalent classes using exposure variable Days are Gender = F and Credit = B.

For example, I can use PROC SUMMARY to determine the prevalent class for each predictive variable:

PROC SUMMARY data=InputData SUM PRINT MISSING;
     CLASS Gender;
     VAR Days;
RUN;

... and then specify the preferred reference levels in the CLASS statement:

PROC HPGENSELECT data=InputData FCONV=1E-8 MAXITER=100 ITSUMMARY;
     CLASS Gender(ref = "F") Credit(ref = "B");  
     MODEL Loss = Gender Credit / dist= Tweedie (p=1.6) link=log;
     WEIGHT Days;
     ODS OUTPUT ParameterEstimates= PEs;
RUN;
PROC PRINT DATA = PEs; RUN;

If I have 10 more categorical predictive variables, is there an elegant way to avoid PROC SUMMARY, pass exposure variable Days to PROC HPGENSELECT, and request PROC HPGENSELECT for each categorical predictive variable use the level with the highest exposure as the base?

Thanks for the insights!

StatDave · Posted 11-26-2022 09:38 PM

See the description of the options in the CLASS statement in the GENMOD documentation. You can use specify the ORDER=FREQ and DESCENDING options as global options (following a slash in the CLASS statement) to order the levels by ascending frequency.

Bear85 · Posted 12-28-2022 08:04 PM

Thanks for your response, StatDave! Yes, options ORDER = FREQ and DESCENDING in the CLASS statement CLASS Statement :: SAS/STAT(R) 12.3 User's Guide: High-Performance Procedures would work if I wanted to select the base level using highest frequency of Gender. However, I need to consider 2nd variable - Days - to determine the prevalent class. For example, I'd like "F" to be the base class for Gender because it has higher sum(Days), even though "M" has higher _FREQ_

Obs	Gender	_FREQ_	Days
1	F	4,000	810,000
2	M	4,821	790,560

In my case, I decided to continue to use the approach from the original post: PROC SUMMARY to determine the prevalent class for each predictive variable, and then specify the preferred reference levels in the CLASS statement.

sbxkoenk · Posted 12-29-2022 04:14 AM

Hello @Bear85 ,

I see ...

Note that you can do all that

PROC SUMMARY + PROC HPGENSELECT

with proper base levels for CLASS variables

in ONE GO (without any manual intervention)!

You can do that with some macro coding or with data-driven code generation in a data step.

Good luck,

Koen

bik01 · Posted 12-29-2022 05:36 AM

The level should match the standard as bik is matching the standard of market as an conversational tool

PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Ready to join fellow brilliant minds for the SAS Hackathon?