%CALC_REFERENCE_CATEGORY displays the "hidden" coefficient in EFFECT encoding for CLASS variables

1 Like

The EFFECT encoding for CLASS variables provides a powerful way to use categorical data in regression analysis. Especially in the case when none of the categories qualifies to be used as the "zero" or the "baseline" category, the EFFECT encoding provides a good option for interpretable regression coefficients. In my article, Display the hidden estimate for the reference category in EFFECT coding for better interpretability, I illustrate an example for the usage of the EFFECT encoding. In this article, I'll lay out a step-by-step explanation of the macro %CALC_REFERENCE_CATEGORY.

A first version of this macro was originally published in chapter 12 of my SAS Press book, "Applying Data Science - Business Case Studies Using SAS," The macro is very helpful to automatically calculate the "hidden" value of the reference category when the EFFECT coding is used. I explain additional features of different CLASS encoding types and compare them by SAS/STAT procedure.

This article presents the enhanced version of the %CALC_REFERENCE_CATEGORY macro. The macro has been extended to work for a larger set of SAS/STAT and SAS Viya Visual Statistics procedures, and has a feature for better usability. This article is related to the these companion pieces:

Display the hidden estimate for the reference category in EFFECT coding for better interpretability
Future article, that gives an overview over different output formats and structures of different SAS regression procedures
Encoding of CLASS Variables in Regression Analysis - Better understand the ORDINAL encoding

Advantage of an automated solution

The coefficient of the REFERENCE category in EFFECT coding can been calculated by summing the coefficients of the other categories and changing the sign. This is also referred to as the negative sum of the coefficients of the other categories. While it is technically possible to perform these calculations by hand, it is more convenient and efficient to use a program to do this automatically.

Prerequisites and limitations for the macro

The macro has the following prerequisites.

The macro %VAREXIST is used by the %CALC_REFERENCE_MACRO. It must therefore be run before the invocation of the %CALC_REFERENCE_MACRO. You can download the %VAREXIST macro in the Appendix of this article. The %VAREXIST macro has been taken from the SAS Communities Post from @Silbad.
In the model, the EFFECT coding has been used for the creation of the dummy variables.
The values of the CLASS variables must not contain blanks. For example, the category value, “Model 1,” is invalid. It needs to be transformed, for example, to “Model1” or “Model_1” before the regression analysis is run.
The dummy variables have been automatically created using the CLASS statement. This is, for example, possible in the GLMSELECT, PHREG, LOGISITC and the SAS Viya Visual Statistics procedures: LOGSELECT, GENSELECT, PHSELECT.
Note that the DMREG procedure in SAS Enterprise Miner also creates dummy variables based on EFFECT coding. However, the macro has not been tested for the output of the DMREG procedure.
The parameter estimates file should contain the p-value for each parameter. This can be requested with the option SHOWPVALUES in the MODEL statement.
The macro assumes that the input table that is used in the macro call has been created using the ODS OUTPUT statement in the respective regression procedure.
The ODS objects PARAMETERESTIMATES and CLASSLEVELINFO can be created with the following ODS OUTPUT statement in the respective regression procedure:

ODS OUTPUT ParameterEstimates= ParameterEstimates
           ClassLevelInfo    = ClassLevelInfo;

Note that the macro ignores interaction terms in the output table. This is because interaction terms are often abbreviated and cannot be reproduced by the macro from the ClassLevelsList.

Macro parameters

The following parameters can be specified with the macro.

ParmEst: The name of the data set that contains the ParameterEstimates, created with the ODS OUTPUT statement. Default = ParameterEstimates.
ClassLevels: The name of the data set that contains the ClassLevelInfo, created with the ODS OUTPUT statement. Default = ClassLevelInfo.
OutputDS: The name of the data set that contains the output data set. Default = _ParmEst_XT_
PROC: The name of the SAS procedures that have been used to perform the analysis. Note that the macro has been tested for the following SAS procedures: GLMSELECT, PHREG, LOGISITC, LOGSELECT, GENSELECT, PHSELECT

Usage example for the macro

See the article, Display the hidden estimate for the reference category in EFFECT coding for better interpretability.

Macro code explained step-by-step

Macro definition

The macro definition contains the available macro parameters and the MINOPERATOR Option.

%macro CALC_REFERENCE_CATEGORY (ParmEst     = ParameterEstimates,
                                ClassLevels = ClassLevelInfo,
                                OutputDS    = _ParmEst_XT_,
                                Proc		= GLMSELECT) 
                                / minoperator;

The MINOPERATOR option controls whether the macro processor recognizes and evaluates the IN logical operator.

Preparing the KEEP variable list

Depending on the procedure that created the parameter estimates dataset, a corresponding KEEP list is generated.

 %if %upcase(&PROC.) IN (LOGSELECT PHSELECT GENSELECT ) %then %let KeepVarStat = StdErr ChiSq ProbChiSq; 
 %else %if %upcase(&PROC.) = LOGISTIC  					%then %let KeepVarStat = StdErr WaldChiSq ProbChiSq; 
 %else %if %upcase(&PROC.) = PHREG     					%then %let KeepVarStat = StdErr ChiSq ProbChiSq HazardRatio; 
 %else %if %upcase(&PROC.) = REGSELECT 					%then %let KeepVarStat = StdErr tValue Probt;
 %ELSE %if %upcase(&PROC.) = GLMSELECT 					%then %let KeepVarStat = StandardizedEst StdErr tValue Probt;

Enumerate the effects in the data

First, a copy of the parameter estimates table is made to keep only the relevant variable and create a new variable, POSVAR, that enumerates the variables. You need this variable to be able to sort the final table in the same way as the original table.

data _ParmEst_;
 length Parameter $32;
 set &ParmEst;
 %if %varexist(&ParmEst, step) AND %upcase(&PROC.) NE PHREG %then %do; if step ne . then delete; %end;
 %if &PROC. = PHSELECT %then %do; Effect = scan(Parameter,1); %end;
 %if %upcase(&PROC.) IN (GLMSELECT PHSELECT REGSELECT LOGSELECT GENSELECT) %then %do;
	 Parameter =  strip(Parameter);
	 Variable  =  scan(parameter,1);
	 ClassLevel = scan(parameter,2);
  keep Effect Parameter Estimate ClassLevel Variable &KeepVarStat. PosVar;
%end; 
 %else %if %upcase(&PROC.) = LOGISTIC %then %do;
 	 rename ClassVal0 = ClassLevel;
	 Effect = Variable;
	 Parameter = catx(" ",effect,ClassVal0);
  keep Effect Parameter Estimate ClassVal0 Variable &KeepVarStat. PosVar;
%end; 
 %else %if %upcase(&PROC.) = PHREG %then %do;
     Variable = Parameter;
	 Effect = Variable;
 	 rename ClassVal0 = ClassLevel;
     Parameter = catx(" ",effect,strip(ClassVal0));     
     keep Effect Parameter Estimate ClassVal0 Variable &KeepVarStat. PosVar;
%end; 
 lag_effect = lag(effect);
 if effect ne lag_effect then PosVar + 1;
run;

Note the following points from the code:

The PosVar+1 expression is used to automatically initialize, increase, and retain variable PosVar. Variable PosVar is only augmented if a new value in the EFFECT variable is found in the row; otherwise the previous value is used.
To determine whether a record contains a new value in the EFFECT variable, the actual value with the lagged value is compared. This can be achieved with the LAG function.
Note however that the LAG does not produce the desired result if it is used in an IF statement. You should store the LAG value in a temporary variable prior to the IF statement and use the temporary variable for the query.
Macro %VAREXIST is used, which checks whether a certain variable is in a data set. In this case, the macro checks to see whether the PARAMETERESTIMATES table contains a STEP variable. This variable is created if the regression procedure is run with variable selection and the details of each step are requested as an option. You'll be interested in this final step for the output only. This step is denoted by STEP=. You can download the %VAREXIST macro in the Appendix of this article.

Split CLASS and INTERVAL variables

In the next step the observations, the PARAMETERESTIMATES data set are split between interval variables and class variables. In the case of a class variable, variable PARAMETER contains both the variable name and the class level. The SCAN function is used to split this value into two parts.

data _ParmEstClass_
     _ParmEstInt_;
 set _ParmEst_;
 if ClassLevel = "" then output _ParmEstInt_;
 else output _ParmEstClass_;
run;

Receiving and scanning the names of all categories

In order to receive the names of the reference categories per variable, you need a full list of all category values. You can find this list in the CLASSLEVELSINFO table.

Note that the output structure for the CLASSLEVELSINFO table differs per procedure.

The PHREG delivers the following output structure.

class phreg 1.png

The GLMSELECT, the LOGISTIC and the SAS Viya Visual Statistics procedures deliver the following output structure. You see that for each class variable, the different levels are concatenated to a string.

Next, you scan through this string and output every single class name into a new line. Note that the blank is used to separate different class levels. Thus, a prerequisite for the macro is that the categories of the CLASS variables must not contain blanks.

A ClassID variable is created that enumerates the class levels in order to be able to sort them in a later step.

Note that the macro considers the different output structures.

%if %upcase(&Proc.) IN (GLMSELECT PHSELECT REGSELECT LOGSELECT GENSELECT) %then %do;
data _ClassLevelsList_(rename=(class=Variable));
 set &ClassLevels;
 drop i levels values;
 do i = 1 to levels;
 	ClassLevel = scan(values,i);
	ClassID + 1;
  	output;
 end;
run;
%end; %* GLMSELECT Mode;
%ELSE %IF %upcase(&Proc.) IN (LOGISTIC PHREG) %then %do;
  data _ClassLevelsList_(rename=(Class=Variable));
   set &ClassLevels;
  keep Class Value ClassID;
  rename Value=ClassLevel;
  format class_tmp $32.;
  retain class_tmp "" ClassID 0;
  if Class ne "" then do; 
				class_tmp = class;
				ClassID+1;   
             end;
  else Class = class_tmp;
 run;
%end; %* LOGISTIC/PHREG Mode;

Remove unused class variables

In the next step, the CLASS statement -- and thus the table CLASSLEVELSLIST -- might contain more variables than estimated in the model and are thus contained in the final PARAMETERESTIMATES table. This might be because not all variables were included in variable selection, or the CLASS statement contained more variables than were actually referenced in the MODEL statement.

You use PROC SQL to delete observations from the CLASSLEVELSLIST. These deleted observations do not appear in the subquery that lists the distinct variable names contained in the PARAMETERESTIMATES table.

proc sql;
 delete from _ClassLevelsList_
 where variable not in (select distinct variable from _ParmEstClass_);
quit;

Creating a master table

The _CLASSLEVELSLIST_ table contains the complete set of class variables with all available categories. It is now joined with a full join with the PARAMETERESTIMATES table that contains the parameter estimates for all categories, except the reference category.

proc sql;
 create table _ParmEstClass_XT_
 as select a.*, b.*
    from _ClassLevelsList_ as a
	full join _ParmEstClass_ as b
	on a.variable = b.variable
	 and a.ClassLevel = b.ClassLevel
    order by variable, classid;
quit;

Now add a Sort to make sure that the reference category is the last record for each variable group. By default, REF = LAST ... so in that case it would work anyhow. However if you specify REF = FIRST or another REF-value, the default assumption is to given and a sorting is needed.

By sorting with the DESCENDING option for ESTIMATE, you've ordered the MISSING Value (= Reference Category) at the end of the group:

proc sort data=_ParmEstClass_XT_; 
 by variable descending Estimate ;
run;

This table now serves as a master table, _ParmEstClass_XT_, which contains all variables and categories with the estimates from the PHREG procedure.

effect coding parmest xt.PNG

Calculating the estimates for the reference categories

Now you only need to calculate the coefficients for these classes, and concatenate them with the parameter estimates for the interval variables that were split away earlier on.

The missing information for the reference categories is filled in the next step. You run the DATA step with a BY variable to treat the observations per variable as a BY group.

data _ParmEstClass_XT_(drop=cum_coeff lag_PosVar ClassID);
 set _ParmEstClass_XT_;
 by variable;
 lag_PosVar = lag(PosVar);
 if first.variable then cum_coeff = estimate;
 else  cum_coeff + estimate;
 if estimate = . then do; 
               estimate = -cum_coeff;
			   PosVar = lag_PosVar;
			end;
run;

To receive the coefficient of the reference category, you just need to sum up the parameters of the dummy variables per variable and change the sign.
For the first observation of every BY group, you initialize the cumulative coefficient with the estimate.
For all other observations in the BY group you just sum up the estimates. This is done with the CUM_COEFF + ESTIMATE statement.
In the last line denoted by a missing value for EFFECT, you insert the cumulated value of all coefficients with an inverted sign.
You also insert the lagged value of PosVar in this line to have nonmissing values in this column. Again you store this lagged value in a separate variable in order not to use the LAG function in an IF Statement.

Creating the final data set

Finally, concatenate this table with estimates of the interval variables that were split away earlier, and fill the values for the EFFECT variable and the PARAMETER variable.

data &OutputDS;
 set _ParmEstClass_XT_ _ParmEstInt_;
 if effect = "" then do; 
              effect = variable;
			  parameter = catx(' ',variable,classlevel);     end;
 drop variable classlevel;
run;

Sort the data to have the variables in the original order and print them.

proc sort data=&OutputDS; by PosVar ;run;
proc print data=&OutputDS; 
 var effect parameter estimate &KeepVarStat. ;
run;

Clean up the temporary data sets that were created.

proc delete data=_ParmEstClass_XT_ _ParmEstClass_ _ParmEstInt_ _ClassLevelsList_ _ParmEst_; 
run;

%mend;

Summary

It's very easy to extend the functionality of SAS/STAT and SAS Visual Statistics procedures with a user-written macro to obtain additional output when the effect coding has been used.

The purpose of this exercise was to obtain interpretable results for the business user. When discussing the analysis results with your business experts, you want to be in a position to make a statement on the influence and the parameterization of the reference category as well. This provides a complete picture and is helpful for the acceptance of the model from a business point of view.

Note that the calculation of the coefficients for the reference group is only performed for interpretational purposes. For the application of the model to score new observations, these estimates must not be used as they are already implicitly considered in the model by the intercept and the other categories.

If you encounter cases with your SAS regression procedures where the %CALC_REFERENCE_CATEGORY macro does not work, contact me. Please provide the SAS code for the regression analysis, the parameter estimates dataset and the class level info dataset.