Most of us find it daunting to be confronted with a huge spreadsheet. So, where do I start? What’s important, what’s not? If your variables look like a large intractable “Hill o’ Beans” you can use iterative modeling to pare down and prioritize variables.A hill o' beans
Fortunately there are easier ways than muddling through data sets. That's why SAS offers easy to use and scalable parameter selection tools. If you need to pare down the number of variables you’re plying, try using the GLMSelect Procedure. Proc GLMSelect applies if your data set has a continuous independent variable (such as yield) and dependent variables including continuous, interval or categorical types.
Let's look at a study that queried the association between Nicaraguan weather, seasons, departments, and crop yields. I am grateful for Gourdji et al. (2015) for uploading the dataset*. We explore several quantitative explanatory factors including four measures of temperature, three of precipitation, plus relative humidity and solar radiation. And if that’s not enough, categorical data includes eight years, 17 departments, and three seasons. We will use a stepwise Proc GLMSelect model to select factors likely to be useful to explain bean yield per sown area.
After reading in the Excel File, we execute the following model:
/*First Sort the data*/
Proc sort data=work.Bean_dataset_new;
by department season 'year'n drydays dtr heavyrainpct rain rhum srad tavg tmax tmin;
run;
/*GLM select models variable influence in a general linear model. All plots option includes iterative variable significance level, fit criteria, and coefficient progression for each variable*/
Proc glmselect data=work.Bean_dataset_new plots=all;
/*Class categorical variables*/
class department season 'year'n;
/*Model the response variable against categorical and continuous variables. Bar between variables factorializes them*/
Stepwise: model 'yieldsown area'n = department|season|'year'n drydays dtr heavyrainpct rain rhum srad tavg tmax tmin / selection=stepwise details=steps select=SL slstay=0.01 slentry=0.01;
/*We use stepwise selection here. We could alternatively use selection=forward or selection=backward.*/
Title “Stepwise model selection for yield – signficicance to enter= 0.01, signficicance to stay= 0.01”;
Run;
First, that the interactions of department and season, and season and year, rise to tip-top importance. But beyond those model effects, other variables didn’t make the final model. After all, we set a significance level at 0.1 for variable entry, and for variables to stay. So, relaxing one or both of these settings may result in additional model variables.
Second, we notice the information criteria (“AIC,” “AICC,” and “SBC”). We will take note of these values if we run other models. Say we decide to transform the response variable or change the interaction terms we model. Or, perhaps we run a forward selection model. Then a backward selection model. The smaller information criteria values connote better models.
Finally, we notice the adjusted R-Sq (0.5078). This correlation stands on the weaker side. This represents a paradox? On one hand, our model did not bring in weather measures. On the other hand, we know that weather exerts notable influence over crop growth and development. So our next action might be to run a regression analysis of weather variables against yield, to target pesky outliers and isolate influential data points.
GLMSelect is a useful way to identify useful variables and interactions. Consider taking SAS Statistics 1 for additional context and coding content (SAS offers this e-Learning free!). Also, take a look under the hood to learn more about how GLMSelect works. Or consider using complementary Machine Learning tools. No matter what, if you prioritize variables, your analysis dimensions will amount to more than a ‘hill o’ beans’.
*Gourdji, Sharon; Läderach, Peter; Martinez Valle, Armando; Zelaya Martinez, Carlos; Lobell, David B. 2015. Replication Data for: Historical climate trends, deforestation, and maize and bean yields in Nicaragua. "Bean_dataset_new.xlsx". Replication Data for: Historical climate trends, deforestation, and maize and bean yields in Nicaragua, https://doi.org/10.7910/DVN/29206/UNZ63Q, Harvard Dataverse, V1.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.