Modifying your Models with GENMOD

1 Like

When analyzing data, many find that their approach frequents the use of Linear Regression and Logistic Regression. Despite both being excellent and very helpful, what if your response variable did not align into the assumptions that are required for these approaches? Do you transform your response, or could there be another option that may solve this problem? In this post, we will discuss the GENMOD procedure within SAS software and provide guidance into using it.

PROC GENMOD is a powerful procedure in SAS software used for fitting generalized linear models. These models extend traditional linear models by allowing the mean of a population to depend on a linear predictor through a nonlinear link function. This flexibility makes PROC GENMOD a versatile tool for various types of data analysis.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

What is a generalized linear model? It is a generalization of ordinary linear regression that allows for response variables to have error distribution models other than a normal distribution. Examples of this would include Poisson regression for count data, Gamma regression for positive real values, and even Logistic regression for binary data. Yes, you heard me correctly. Logistic regression is a form generalized linear modeling. It is just so frequently used that it got its own procedure within SAS software.

Linear Regression analysis includes the assumption that the error term is normally distributed. This normal distribution then gets transferred to the response variable.

Logistic Regression analysis works when the response variable is binary in nature or multinomial in nature. This means that the response is a yes/no or the response is finite number of levels which could be ordinal or nominal in nature.

Much time and energy within statistics classes focus on these two approaches as they are quite frequent in application. However, imagine that you are modeling calls to a call center. The response variable is the number of calls that are answered for each day. When we look at the support for this response variable, we quickly see that it does not cater to either the case for Linear Regression or for Logistic Regression. So, what is this analysis going to be?

The support is non-negative integers. If we were to look for possible distributions that cater to this type of support, we would see a couple options: Poisson, Negative Binomial, and Tweedie. But can we perform an analysis that utilizes these types of distributions? Certainly!

Key Features of PROC GENMOD

The first key feature of PROC GENMOD is the use of distributions from the Exponential Family. To be a member of the exponential family, a distribution, discrete or continuous, should be able to be written in a specific form. Any distribution that can accomplish this is a member of the exponential family and capable of being used within PROC GENMOD. Examples of such distributions include normal, binary, Poisson, gamma, etc. Within PROC GENMOD, you would simply tell SAS which distribution you would like to use with the DIST= option within the MODEL statement.

proc genmod data=work.crab;
   class color spine;
   model satellites = color spine width weight / dist=poisson link=log;
run; quit;

Which possible distribution might your response variable follow? Here is a guide that may assist in making this decision. Depending on the support of your response variable, use this table to determine a suggested distribution.

Second, PROC GENMOD supports various link functions such as logit, probit, log, and identity. To fully understand a link function, let's look at a linear regression model. In this situation, the expected value of the response Y is said to be equal to the linear combination of the predictors and their associated estimated parameters. But linear regression has the assumption of normality. The values of our linear predictor can take values of positive, negative, and zero. The support of a normal distribution can handle this.

Now let's consider a response that is following a Poisson distribution. Recall that a Poisson distribution has a support of non-negative integers. Thus, the expected value of the response must be positive. In this case, we can take the log of the expected count and model that equal to our linear predictor. This would result in the expected count being the exponentiation of the linear predictor. This would mathematically force the value of the expected count to be positive.

Each distribution in PROC GENMOD has a default or canonical link function that is typically used for its regression analysis. However, you are allowed to use a different link function if you believe that the relationship between the expected value of the response and the linear predictor is different from the canonical link. SAS awaits your decision of the link function in the LINK= option on the MODEL statement.

The following table notes the canonical or default link function that is used for each distribution type.

SAS will automatically utilize the default/canonical link for a distribution unless you directly state you would like a different link function to be used. This statement is not true in reverse. The choice of link function will not automatically choose an associated distribution.

The third key feature of PROC GENMOD is the ability of flexible model specifications. Users can specify complex models with multiple predictors, interactions, and nested effects.

proc genmod data=work.crab;
   class color spine;
   model satellites = color|spine width weight / dist=poisson link=log;
run; quit;

The fourth key feature of PROC GENMOD is assessment of model fit. PROC GENMOD provides various statistics and tests to assess the fit of the model, including deviance, likelihood ratio tests, and residual analysis. Items like AIC, AICC, and BIC should be familiar. These are goodness of fit statistics that are used to compare models in multiple different modeling scenarios.

Users can also question the functional form of a predictor using the ASSESS statement and utilizing cumulative residual plots. The addition of the SEED= option allows for replication of the randomization across multiple runs of the code. The RESAMPLE= option controls the number of simulations that will be used in the creation of the statistic.

proc genmod data=work.birth descending;
   class ETH(ref='3') PTL HT UI FTV(param=ordinal) / param=ref ref=first;
   model LOW = AGE|FTV LWT PTL HT / dist=binomial link=logit;
   assess var=(age) / resample=5000 seed=27513;
   title 'Low Birth Weight Model';
run; quit;

This is only the beginning of the capabilities of PROC GENMOD. From zero-inflation to GEEs, GENMOD is a very powerful and flexible procedure. But those topics are best left to a different post. If you would like more information about GENMOD, please reference the documentation here.

Find more articles from SAS Global Enablement and Learning here.

Modifying your Models with GENMOD

Registration is open

SAS AI and Machine Learning Courses