The EFFECT encoding for CLASS variables provides a powerful way to use categorical data in regression analysis. Especially in the case when none of the categories qualifies to be used as the "zero" or the "baseline" category, the EFFECT encoding provides a good option for interpretable regression coefficients. In my article, Display the hidden estimate for the reference category in EFFECT coding for better interpretability, I illustrate an example for the usage of the EFFECT encoding. In this article, I'll lay out a step-by-step explanation of the macro %CALC_REFERENCE_CATEGORY.
A first version of this macro was originally published in chapter 12 of my SAS Press book, "Applying Data Science - Business Case Studies Using SAS," The macro is very helpful to automatically calculate the "hidden" value of the reference category when the EFFECT coding is used. I explain additional features of different CLASS encoding types and compare them by SAS/STAT procedure.
This article presents the enhanced version of the %CALC_REFERENCE_CATEGORY macro. The macro has been extended to work for a larger set of SAS/STAT and SAS Viya Visual Statistics procedures, and has a feature for better usability. This article is related to the these companion pieces:
Display the hidden estimate for the reference category in EFFECT coding for better interpretability
The coefficient of the REFERENCE category in EFFECT coding can been calculated by summing the coefficients of the other categories and changing the sign. This is also referred to as the negative sum of the coefficients of the other categories. While it is technically possible to perform these calculations by hand, it is more convenient and efficient to use a program to do this automatically.
The macro has the following prerequisites.
ODS OUTPUT ParameterEstimates= ParameterEstimates
ClassLevelInfo = ClassLevelInfo;
The following parameters can be specified with the macro.
See the article, Display the hidden estimate for the reference category in EFFECT coding for better interpretability.
The macro definition contains the available macro parameters and the MINOPERATOR Option.
%macro CALC_REFERENCE_CATEGORY (ParmEst = ParameterEstimates,
ClassLevels = ClassLevelInfo,
OutputDS = _ParmEst_XT_,
Proc = GLMSELECT)
/ minoperator;
The MINOPERATOR option controls whether the macro processor recognizes and evaluates the IN logical operator.
Depending on the procedure that created the parameter estimates dataset, a corresponding KEEP list is generated.
%if %upcase(&PROC.) IN (LOGSELECT PHSELECT GENSELECT ) %then %let KeepVarStat = StdErr ChiSq ProbChiSq;
%else %if %upcase(&PROC.) = LOGISTIC %then %let KeepVarStat = StdErr WaldChiSq ProbChiSq;
%else %if %upcase(&PROC.) = PHREG %then %let KeepVarStat = StdErr ChiSq ProbChiSq HazardRatio;
%else %if %upcase(&PROC.) = REGSELECT %then %let KeepVarStat = StdErr tValue Probt;
%ELSE %if %upcase(&PROC.) = GLMSELECT %then %let KeepVarStat = StandardizedEst StdErr tValue Probt;
First, a copy of the parameter estimates table is made to keep only the relevant variable and create a new variable, POSVAR, that enumerates the variables. You need this variable to be able to sort the final table in the same way as the original table.
data _ParmEst_;
length Parameter $32;
set &ParmEst;
%if %varexist(&ParmEst, step) AND %upcase(&PROC.) NE PHREG %then %do; if step ne . then delete; %end;
%if &PROC. = PHSELECT %then %do; Effect = scan(Parameter,1); %end;
%if %upcase(&PROC.) IN (GLMSELECT PHSELECT REGSELECT LOGSELECT GENSELECT) %then %do;
Parameter = strip(Parameter);
Variable = scan(parameter,1);
ClassLevel = scan(parameter,2);
keep Effect Parameter Estimate ClassLevel Variable &KeepVarStat. PosVar;
%end;
%else %if %upcase(&PROC.) = LOGISTIC %then %do;
rename ClassVal0 = ClassLevel;
Effect = Variable;
Parameter = catx(" ",effect,ClassVal0);
keep Effect Parameter Estimate ClassVal0 Variable &KeepVarStat. PosVar;
%end;
%else %if %upcase(&PROC.) = PHREG %then %do;
Variable = Parameter;
Effect = Variable;
rename ClassVal0 = ClassLevel;
Parameter = catx(" ",effect,strip(ClassVal0));
keep Effect Parameter Estimate ClassVal0 Variable &KeepVarStat. PosVar;
%end;
lag_effect = lag(effect);
if effect ne lag_effect then PosVar + 1;
run;
Note the following points from the code:
In the next step the observations, the PARAMETERESTIMATES data set are split between interval variables and class variables. In the case of a class variable, variable PARAMETER contains both the variable name and the class level. The SCAN function is used to split this value into two parts.
data _ParmEstClass_
_ParmEstInt_;
set _ParmEst_;
if ClassLevel = "" then output _ParmEstInt_;
else output _ParmEstClass_;
run;
In order to receive the names of the reference categories per variable, you need a full list of all category values. You can find this list in the CLASSLEVELSINFO table.
Note that the output structure for the CLASSLEVELSINFO table differs per procedure.
The PHREG delivers the following output structure.
The GLMSELECT, the LOGISTIC and the SAS Viya Visual Statistics procedures deliver the following output structure. You see that for each class variable, the different levels are concatenated to a string.
Next, you scan through this string and output every single class name into a new line. Note that the blank is used to separate different class levels. Thus, a prerequisite for the macro is that the categories of the CLASS variables must not contain blanks.
A ClassID variable is created that enumerates the class levels in order to be able to sort them in a later step.
Note that the macro considers the different output structures.
%if %upcase(&Proc.) IN (GLMSELECT PHSELECT REGSELECT LOGSELECT GENSELECT) %then %do;
data _ClassLevelsList_(rename=(class=Variable));
set &ClassLevels;
drop i levels values;
do i = 1 to levels;
ClassLevel = scan(values,i);
ClassID + 1;
output;
end;
run;
%end; %* GLMSELECT Mode;
%ELSE %IF %upcase(&Proc.) IN (LOGISTIC PHREG) %then %do;
data _ClassLevelsList_(rename=(Class=Variable));
set &ClassLevels;
keep Class Value ClassID;
rename Value=ClassLevel;
format class_tmp $32.;
retain class_tmp "" ClassID 0;
if Class ne "" then do;
class_tmp = class;
ClassID+1;
end;
else Class = class_tmp;
run;
%end; %* LOGISTIC/PHREG Mode;
In the next step, the CLASS statement -- and thus the table CLASSLEVELSLIST -- might contain more variables than estimated in the model and are thus contained in the final PARAMETERESTIMATES table. This might be because not all variables were included in variable selection, or the CLASS statement contained more variables than were actually referenced in the MODEL statement.
You use PROC SQL to delete observations from the CLASSLEVELSLIST. These deleted observations do not appear in the subquery that lists the distinct variable names contained in the PARAMETERESTIMATES table.
proc sql;
delete from _ClassLevelsList_
where variable not in (select distinct variable from _ParmEstClass_);
quit;
The _CLASSLEVELSLIST_ table contains the complete set of class variables with all available categories. It is now joined with a full join with the PARAMETERESTIMATES table that contains the parameter estimates for all categories, except the reference category.
proc sql;
create table _ParmEstClass_XT_
as select a.*, b.*
from _ClassLevelsList_ as a
full join _ParmEstClass_ as b
on a.variable = b.variable
and a.ClassLevel = b.ClassLevel
order by variable, classid;
quit;
proc sort data=_ParmEstClass_XT_;
by variable descending Estimate ;
run;
This table now serves as a master table, _ParmEstClass_XT_, which contains all variables and categories with the estimates from the PHREG procedure.
Now you only need to calculate the coefficients for these classes, and concatenate them with the parameter estimates for the interval variables that were split away earlier on.
The missing information for the reference categories is filled in the next step. You run the DATA step with a BY variable to treat the observations per variable as a BY group.
data _ParmEstClass_XT_(drop=cum_coeff lag_PosVar ClassID);
set _ParmEstClass_XT_;
by variable;
lag_PosVar = lag(PosVar);
if first.variable then cum_coeff = estimate;
else cum_coeff + estimate;
if estimate = . then do;
estimate = -cum_coeff;
PosVar = lag_PosVar;
end;
run;
Finally, concatenate this table with estimates of the interval variables that were split away earlier, and fill the values for the EFFECT variable and the PARAMETER variable.
data &OutputDS;
set _ParmEstClass_XT_ _ParmEstInt_;
if effect = "" then do;
effect = variable;
parameter = catx(' ',variable,classlevel); end;
drop variable classlevel;
run;
Sort the data to have the variables in the original order and print them.
proc sort data=&OutputDS; by PosVar ;run;
proc print data=&OutputDS;
var effect parameter estimate &KeepVarStat. ;
run;
Clean up the temporary data sets that were created.
proc delete data=_ParmEstClass_XT_ _ParmEstClass_ _ParmEstInt_ _ClassLevelsList_ _ParmEst_;
run;
%mend;
It's very easy to extend the functionality of SAS/STAT and SAS Visual Statistics procedures with a user-written macro to obtain additional output when the effect coding has been used.
The purpose of this exercise was to obtain interpretable results for the business user. When discussing the analysis results with your business experts, you want to be in a position to make a statement on the influence and the parameterization of the reference category as well. This provides a complete picture and is helpful for the acceptance of the model from a business point of view.
Note that the calculation of the coefficients for the reference group is only performed for interpretational purposes. For the application of the model to score new observations, these estimates must not be used as they are already implicitly considered in the model by the intercept and the other categories.
If you encounter cases with your SAS regression procedures where the %CALC_REFERENCE_CATEGORY macro does not work, contact me. Please provide the SAS code for the regression analysis, the parameter estimates dataset and the class level info dataset.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.