Hill o’ Beans: Prioritize variables using GLMSelect

3 Likes

Most of us find it daunting to be confronted with a huge spreadsheet. So, where do I start? What’s important, what’s not? If your variables look like a large intractable “Hill o’ Beans” you can use iterative modeling to pare down and prioritize variables.A hill o' beans

Fortunately there are easier ways than muddling through data sets. That's why SAS offers easy to use and scalable parameter selection tools. If you need to pare down the number of variables you’re plying, try using the GLMSelect Procedure. Proc GLMSelect applies if your data set has a continuous independent variable (such as yield) and dependent variables including continuous, interval or categorical types.

Case Study: Nicaragua Crop Yields

Let's look at a study that queried the association between Nicaraguan weather, seasons, departments, and crop yields. I am grateful for Gourdji et al. (2015) for uploading the dataset*. We explore several quantitative explanatory factors including four measures of temperature, three of precipitation, plus relative humidity and solar radiation. And if that’s not enough, categorical data includes eight years, 17 departments, and three seasons. We will use a stepwise Proc GLMSelect model to select factors likely to be useful to explain bean yield per sown area.

The Code

After reading in the Excel File, we execute the following model:

/*First Sort the data*/
Proc sort data=work.Bean_dataset_new;
by department season 'year'n drydays dtr heavyrainpct rain rhum srad tavg tmax tmin;
run;
/*GLM select models variable influence in a general linear model. All plots option includes iterative variable significance level, fit criteria, and coefficient progression for each variable*/
Proc glmselect data=work.Bean_dataset_new plots=all;
/*Class categorical variables*/
class department season 'year'n;
/*Model the response variable against categorical and continuous variables. Bar between variables factorializes them*/
Stepwise: model 'yieldsown area'n = department|season|'year'n drydays dtr heavyrainpct rain rhum srad tavg tmax tmin / selection=stepwise details=steps select=SL slstay=0.01 slentry=0.01;
/*We use stepwise selection here.  We could alternatively use selection=forward or selection=backward.*/
Title “Stepwise model selection for yield – signficicance to enter= 0.01, signficicance to stay= 0.01”;
Run;

How the GLMSelect procedure helped prioritize variables

First, that the interactions of department and season, and season and year, rise to tip-top importance. But beyond those model effects, other variables didn’t make the final model. After all, we set a significance level at 0.1 for variable entry, and for variables to stay. So, relaxing one or both of these settings may result in additional model variables.

Second, we notice the information criteria (“AIC,” “AICC,” and “SBC”). We will take note of these values if we run other models. Say we decide to transform the response variable or change the interaction terms we model. Or, perhaps we run a forward selection model. Then a backward selection model. The smaller information criteria values connote better models.

Finally, we notice the adjusted R-Sq (0.5078). This correlation stands on the weaker side. This represents a paradox? On one hand, our model did not bring in weather measures. On the other hand, we know that weather exerts notable influence over crop growth and development. So our next action might be to run a regression analysis of weather variables against yield, to target pesky outliers and isolate influential data points.

The take home message

GLMSelect is a useful way to identify useful variables and interactions. Consider taking SAS Statistics 1 for additional context and coding content (SAS offers this e-Learning free!). Also, take a look under the hood to learn more about how GLMSelect works. Or consider using complementary Machine Learning tools. No matter what, if you prioritize variables, your analysis dimensions will amount to more than a ‘hill o’ beans’.

*Gourdji, Sharon; Läderach, Peter; Martinez Valle, Armando; Zelaya Martinez, Carlos; Lobell, David B. 2015. Replication Data for: Historical climate trends, deforestation, and maize and bean yields in Nicaragua. "Bean_dataset_new.xlsx". Replication Data for: Historical climate trends, deforestation, and maize and bean yields in Nicaragua, https://doi.org/10.7910/DVN/29206/UNZ63Q, Harvard Dataverse, V1.