topic Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base in Statistical Procedures

PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Bear85 — Thu, 10 Nov 2022 23:57:52 GMT

Suppose I have insurance dataset with 2 categorical predictive variables: Gender (F/M) and Credit (A/B/C/D/E)
I also have exposure variable Days that I will use as the weight in PROC HPGENSELECT.

PROC HPGENSELECT data=InputData FCONV=1E-8 MAXITER=100 ITSUMMARY;
     CLASS Gender Credit;  
     MODEL Loss = Gender Credit / dist= Tweedie (p=1.6) link=log;
     WEIGHT Days;
     ODS OUTPUT ParameterEstimates= PEs;
RUN;
PROC PRINT DATA = PEs; RUN;

From 37108 - Setting reference levels for CLASS predictor variables (sas.com) I know that by default the levels are arranged in ascending alphanumeric order -> so M will become the base level for Gender, and E will become the base level for Credit.

However, the prevalent classes using exposure variable Days are Gender = F and Credit = B.

For example, I can use PROC SUMMARY to determine the prevalent class for each predictive variable:

PROC SUMMARY data=InputData SUM PRINT MISSING;
     CLASS Gender;
     VAR Days;
RUN;

... and then specify the preferred reference levels in the CLASS statement:

PROC HPGENSELECT data=InputData FCONV=1E-8 MAXITER=100 ITSUMMARY;
     CLASS Gender(ref = "F") Credit(ref = "B");  
     MODEL Loss = Gender Credit / dist= Tweedie (p=1.6) link=log;
     WEIGHT Days;
     ODS OUTPUT ParameterEstimates= PEs;
RUN;
PROC PRINT DATA = PEs; RUN;

If I have 10 more categorical predictive variables, is there an elegant way to avoid PROC SUMMARY, pass exposure variable Days to PROC HPGENSELECT, and request PROC HPGENSELECT for each categorical predictive variable use the level with the highest exposure as the base?

Thanks for the insights!

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

StatDave — Sun, 27 Nov 2022 02:38:01 GMT

See the description of the options in the CLASS statement in the GENMOD documentation. You can use specify the ORDER=FREQ and DESCENDING options as global options (following a slash in the CLASS statement) to order the levels by ascending frequency.

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

Bear85 — Thu, 29 Dec 2022 01:04:41 GMT

Thanks for your response, StatDave! Yes, options ORDER = FREQ and DESCENDING in the CLASS statement CLASS Statement :: SAS/STAT(R) 12.3 User's Guide: High-Performance Procedures would work if I wanted to select the base level using highest frequency of Gender. However, I need to consider 2nd variable - Days - to determine the prevalent class. For example, I'd like "F" to be the base class for Gender because it has higher sum(Days), even though "M" has higher _FREQ_

Obs	Gender	_FREQ_	Days
1	F	4,000	810,000
2	M	4,821	790,560

In my case, I decided to continue to use the approach from the original post: PROC SUMMARY to determine the prevalent class for each predictive variable, and then specify the preferred reference levels in the CLASS statement.

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

sbxkoenk — Thu, 29 Dec 2022 09:14:38 GMT

Hello @Bear85 ,

I see ...

Note that you can do all that

PROC SUMMARY + PROC HPGENSELECT

with proper base levels for CLASS variables

in ONE GO (without any manual intervention)!

You can do that with some macro coding or with data-driven code generation in a data step.

Good luck,

Koen

Re: PROC HPGENSELECT: Categorical Variable: Use the Level with the Highest Exposure as the Base

bik01 — Thu, 29 Dec 2022 10:36:10 GMT

The level should match the standard as bik is matching the standard of market as an conversational tool