Fitting Regression Models to Formulation Data

1 Like

In my previous post I discussed no-intercept regression models, when to use them, and how to interpret the results. The purpose of this post is to build on those concepts but present them in the unique situation of formulation data that changes the interpretation of the models.

A formulation is where the predictor variables, called ingredients or components, sum to a constant value, usually 1 or 100%. An example would be a situation where three different fruit juices need to be blended to create a new better tasting juice. You could create a glass of the blend or a tank truck full of the blend and it will not matter. Because of this, we measure the proportion of the ingredients in the blend rather than their amounts. These situations are common in chemistry and pharmaceuticals to name just two industries.

Analyzing formulation data has some inherent challenges. Because of the constraint that the ingredients add to one, a typical regression model cannot be fit. The constraint or multicollinearity makes it impossible to find the parameter estimates. As an example, this program creates a small example of some formulation trials and fits a regression model.

data mix;
     input x1 x2 x3 y;
     datalines;
1 0 0 38.17
1 0 0 37.00
0 1 0 41.10
0 1 0 41.65
0 0 1 71.27
0 0 1 69.50
;
run;
title1 'Regression with Intercept';
proc reg data=mix plots=(fit(nolimits));
     model y=x1 x2 x3;
run;

The results from this regression show the issues with formulation data. A note in the output states that the model is not full rank. To provide some analysis, SAS will set a parameter to 0. The parameter that is set to zero will always be the last main effect in the model. This also leads to other parameter estimates being biased. The output even includes a line indicating the multicollinearity that is found.

Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

So how should the formulation data be analyzed? The models are adjusted using the constraint. The Scheffe canonical form is the most common adjustment. The original regression model for this example is Y=B0+B1*X1+B2*X2+B3*X3. But since X1+X2+X3=1, then we can rewrite this as Y=B0*(X1+X2+X3)+B1*X1+B2*X2+B3*X3. Simplifying and regrouping the terms leads to Y=(B0+B1)*X1+(B0+B2)*X2+(B0+B3)*X3. We can then rewrite the terms in parentheses as B1', B2', and B3' respectively which yields Y=B1'*X1+B2'*X2+B3'*X3. This reparameterization of the model allows the model to be fit despite the constraint. So although the model looks like a no-intercept model, there is an intercept. The intercept is included with the main effect parameter estimates.

With this unusual model one might think that it would be difficult to fit this Scheffe canonical mixture model. But, in fact, it is not difficult. You have3 three choices. The first approach would be to use the NOINT option of PROC REG.

title1 'Regression with NOINT Option';
proc reg data-mix plots=(fit(nolimits));
model y=x1 x2 x3/noint;
run;

This code would yield these results:

With this output, you will notice that the model is considered a no-intercept model, as expected. This means that the Analysis of Variance table will not be correct for this revised model. You can see this by looking at the degrees of freedom for the model. There are three parameters in the model, so the correct degrees of freedom should be 3-1=2 since there is still a mean to estimate. However, the parameter estimates are correct. This model would be Y=37.585*X1+41.375*X2+70.385*X3. The tests on those parameter estimates are using the null hypothesis of the parameter estimate=0. Although the test is performed correctly, the test is of limited usefulness since the parameter estimate contains the intercept. We would not expect the parameter estimate to be equal to 0 in this situation.

The second possibility is to use the RESTRICT INTERCEPT=0 option. The code and the results are here.

title1 'Regression with RESTRICT Option';
proc reg data=mix plots=(fit(nolimits));
     model y=x1 x2 x3;
     restrict intercept=0;
run;

These results are essentially the same as what was seen when using the NOINT option.

Finally, the last option may not make much sense at first, but it does work. Instead of fitting a no-intercept model, fit a model with an intercept, but leave off one of the main effects. The intercept line of the parameter estimates table will be the parameter estimate for the removed term. This approach will have the added benefit of providing the proper ANOVA table.

title1 ' Regression with Intercept, but One Term Removed';
proc reg data=mix plots=(fit(nolimits));
     model y=x1 x2;
run;

Starting with the Parameter Estimates table, you will see that the intercept has the same estimate as X3 from the other analyses and the X1 and X2 estimates are the same also. We also have the added benefit of the ANOVA table now being correct since there are only 2 degrees of freedom for the model.

Formulation models are different from other regression models. They appear to be a no-intercept model, but in reality, there is an intercept that is not expressed directly due to the constraint the components must add to 1. Although NOINT and RESTRICT options could be used, the most informative results can be obtained by simply removing one of the effects from the model and leaving the intercept as the placeholder for the missing effect.

Find more articles from SAS Global Enablement and Learning here.

Fitting Regression Models to Formulation Data

Ready to join fellow brilliant minds for the SAS Hackathon?

Free course: Data Literacy Essentials

Get Started