Encoding of CLASS Variables in Regression Analysis - Better understand the ORDINAL encoding

3 Likes

In my last presentation at the German SAS User Conference (KSFE) I presented tips and tricks for how you can better communicate your machine learning results. One of these ideas deals with the interpretation of the coefficients of a CLASS variable in regression analysis when the EFFECT encoding has been used. I will share this approach and the respective macro in my next blog on SAS Communities.

After the conference, a customer reached out to me and asked me about the interpretation of the ORDINAL encoding of CLASS variables as described in the SAS documentation for SAS9 procedures and for SAS Viya procedures.

She was particularly interested in the interpretation of the monotonic effect. The documentation for the ORDINAL encoding says: When the parameters have the same sign, the effect is monotonic across the levels.

Based on a simple scenario on the influence of the ordinal variable "user rating" on "book sales numbers," we discussed possible effect scenarios and the respective parameter values. In this blog, I want to share the example and the SAS code with you.

Prepare the Book Sales Demo Data

You create demo data for that case with a SAS datastep and the OUTPUT statement.

data BookSales;
 call streaminit(21980);
 do i = 1 to 120;
  Rating = 1;
  UnitsSold = round(abs(rand('Normal',50,25)));
  output;  
  Rating = 2;
  UnitsSold = round(abs(rand('Normal',60,30)));
  output;
  Rating = 3;
  UnitsSold = round(abs(rand('Normal',80,35)));
  output;
  Rating = 4;
  UnitsSold = round(abs(rand('Normal',95,30)));
  output;
  Rating = 5;
  UnitsSold = round(abs(rand('Normal',105,25)));
  output;
 end;
 drop i;
run;

The RAND function with the 'NORMAL' option is used to generate random values from the normal distribution.

The CALL STREAMINIT statement is used to initialize the random number generator.

The resulting dataset contains Variable RATING and UNITSSOLD.

Display and describe your analysis data

You use the SGPLOT procedure to create a box-plot for your analysis data.

proc sgplot data=BookSales;
 title Sold Copies by Rating;
 vbox UnitsSold / group=Rating;
run;

You can see the increasing number of Units Sold with the increasing user rating.

Proc Means simply shows the average number of units sold by user rating.

proc means data=BookSales mean maxdec=1;
 title Sold Copies (average) by Rating;
 class Rating;
 var UnitsSold;
run;

Use the GLMSELECT procedure to analyze the relationship between Units Sold and the Rating

In the linear model you have different options on how to use the CLASS variable.

You could use it as interval variable. This means you treat the rating variable as if it was a measure on a measurement scale. This is however not necessarily the case with user ratings. A difference of 2 rating points between 1 star and 3 stars is for the user, not necessarily the same difference as between 3 stars and 5 stars. If you want to model it as an interval variable, you just specify the MODEL statement and omit the CLASS statement.

model UnitsSold = Rating;

You can use it as categorical variable. In this case you specify it in the CLASS statement. Here you can decide whether you want to treat it as a nominal or as an ordinal variable.

CLASS Rating / param=ORDINAL;
CLASS Rating / param=REFERENCE; ** other options are EFFECT, GLM, ... ;

Using the ORDINAL encoding

Let's now have a look at the result if you treat it as an ordinal variable, which reflects the business content of the variable. You specify the GLMSELECT procedure with the following code.

proc glmselect data=BookSales;
 title Linear Model: CopiesSold = Rating;
 class Rating / param=ordinal;
 model UnitsSold = Rating;
run;

The SAS documentation illustrates the values of the dummy variables for different encodings. If the ORDINAL encoding is used, the dummy variables are created as follows (note that this example is not based on the rating 1-5 star data, but on other data with values 1,2,5,7):

ordinal coding.PNG

The parameter estimates for the model that you receive from the GLMSELECT procedure look as follows:

parm 1.PNG

If you compare the parameter estimates with the output of the means procedure above, you see that you can interpret them as follows:

the INTERCEPT is the average number of copies sold when RATING = 1
the values of RATING2-RATING5 are the values that are added, if a higher rating is achieved

Finding: the parameter estimates based on the ORDINAL encoding can be interpreted as the marginal effect on the outcome when a certain level on the ordinal scale is reached. In the case of the book sales: the marginal effect of moving from a 2-star rating to a 3-star rating is 21 units.

Using the REFERENCE encoding

If you want to use the REFERENCE encoding instead of the ORDINAL encoding, you simply specify the code as follows:

proc glmselect data=BookSales;
 class Rating / param=reference ref=first;
 model UnitsSold = Rating;
run;

Note that in this example, the REFERENCE category is specified explicitly with the REF= option. Here the first (lowest rating) is used as reference category. By default it would be the category with the highest ranking.

parm refcoding 1.PNG

You see that the the intercept is identical with the results from the ORDINAL encoding. The coefficients for RATING2 - RATING5 are now interpreted as the difference in the outcome between the reference category (RATING1) and the respective level on the ordinal scale.

From a business point of view, this encoding makes sense if you are not interested in the marginal difference when increasing the rating, but you can select a specific value of the CLASS variable as a reference for interpretation. This could, for example, be the highest or the lowest value.

In chapter 12 of my SAS Press book, "Applying Data Science - Business Case Studies Using SAS," I explain additional features of different CLASS encoding types and compare them by SAS/STAT procedure. In this chapter, I also introduce the macro %CALC_REFERENCE_CATEGORY which is very helpful to automatically calculate the "hidden" value of the reference category when the EFFECT coding is used. I will introduce this macro in a future blog contribution here on SAS Communities.

Re-Run the Analysis with a Non-Monotonic Effect

In the conversation with my client, we also selected the scenario when the effect is not monotonic and the resulting regression coefficients. You use the the following statements to create a new dataset with a non-monotonic effect.

data BookSales_NonMonotonic;
 call streaminit(21980);
 do i = 1 to 120;
  Rating = 1;
  UnitsSold = round(abs(rand('Normal',50,25)));
  output;  
  Rating = 2;
  UnitsSold = round(abs(rand('Normal',75,30)));
  output;
  Rating = 3;
  UnitsSold = round(abs(rand('Normal',65,35)));
  output;
  Rating = 4;
  UnitsSold = round(abs(rand('Normal',85,25)));
  output;
  Rating = 5;
  UnitsSold = round(abs(rand('Normal',105,20)));
  output;
 end;
run;

Plotting the data with the SGPLOT procedure ...

proc sgplot data=BookSales_NonMonotonic;
 title Sold Copies by Rating  (Non Monotonic Data);
 vbox UnitsSold / group=Rating;
run;

gives the following picture.

You see the the number of book sales for a 3-star rating falls behind the average sales with a 2-star rating and only increases for 4- and 5-star ratings.

To calculate the regression coefficients, you run the linear model with the same statements as before, just for the new dataset.

proc glmselect data=BookSales_NonMonotonic;
 title Linear Model: CopiesSold = Rating (Non Monotonic Data);
 class Rating / param=ordinal;
 model UnitsSold = Rating;
run;

This results in the following parameter estimates:

You see again, that the parameter estimates reflect the increase between "1 and 2", "3 and 4", and "4 and 5" as well as the decrease between the "2 and 3" rating grades.

Conclusion

The CLASS statement in SAS provides a large set of effect coding options. This allows you to formulate your regressions models for the most appropriate business interpretation. Often the REFERENCE coding is used as a default and is useful if you select a particular reference category and interpret the other categories relative to this reference category. However, there are situations where from a business point of view, it makes more sense the study the difference between adjacent categories.

These effect coding options are available for SAS9 procedures like CATMOD, GENMOD, GLMSELECT, LOGISTIC, PHREG, and SURVEYPHREG and for many SAS Viya regression procedures.

In my next blog we will take a closer look at the EFFECT coding and a macro that allows you to calculate the "hidden" reference category.