When it rains it pours: connect weather to yield using PROC REG

1 Like

Connecting weather to yield

Old timer farmers can seemingly miraculously connect weather to yield. “This year is exactly like 1958. Our crop will be middling.”

Unfortunately, those focused on crop improvement can’t interview everyone. So we can’t know how weather will affect crops everywhere. Unless there is weather data. And until we know how to make sense of it.

The problem

Suppose you have a weather station that outputs temperature [average, min and max] moisture (precipitation, heavy rain and dry days) plus relative humidity and solar radiation. Gourdji et al. (2015) published a dataset that includes those things, along with bean yields yields*.

Where do you start with weather data like this? You could plot a few graphs plotting each weather endpoint against yield. But that doesn’t capture data substructure (regions, years), and correlations will be poor. Alternatively you could make dozens of graph for each weather endpoint versus year, field, etc. But that’s time consuming.

The approach

Regression analyses can simultaneously incorporate all these variables. With regressions, we can loosely connect weather to yields. When paired with anomaly detection tools (RStudent Residuals, Cook’s D, DFFITS and DFBETA), we tighten the connection. Quality regressions form crucial linchpins to forecast or optimize crop yields versus weather.

Here, we plot crop yields per sown area against quantitative weather variables including temperature, moisture and other variables. Then, we conduct an outlier check (RStudent) and assess influential data points (Cook’s D, DFFITS and DFBETA). Finally, we merge these into a table where we flag anomalies.

/*First Sort the data*/
Proc sort data= work.Bean_dataset_new;
by department season 'year'n drydays dtr heavyrainpct rain rhum srad tavg tmax tmin;
run;
*/generate and name various output plots*/;
ods output RSTUDENTBYPREDICTED=Rstud
COOKSDPLOT=Cook
DFFITSPLOT=Dffits
DFBETASPANEL=Dfbs;
*/conduct a regression with outputs of the named Diagnostic plots*/;
proc reg data= work.Bean_dataset_new
plots(only label)=
(RSTUDENTBYPREDICTED
COOKSD
DFFITS
DFBETAS);
*/model quantitative yield/sown area against quantitative weather variables, plus year*/;
SigLimit: model'yieldsown area'n = drydays heavyrainpct rain rhum srad tavg tmax tmin;
title 'SigLimit Model - Plots of Diagnostic Statistics';
run;
quit;
data influential;
/*  Merge datasets from above.*/
merge Rstud
Cook
Dffits
Dfbs;
by observation;
/*The output will be by observation number, mapping to the sort order in the first procedure*/
/*  Flag observations that have exceeded at least one cutpoint;*/
if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1;
array dfbetas{*} _dfbetasout: ;
do i=2 to dim(dfbetas);
if dfbetas{i} then flag=1;
end;
/*  This will only flag potentially significant data points;*/

Interpreting the output

What does the output reveal? First, the regression was significant, p<0.0001. Not a surprise that yield is not random relative to weather. Second, the adjusted r-squared equals 0.12. Ouch! That’s spells little relative explanatory power. So, it may be worthwhile to take a closer look at outliers and influential data points. Third, measures of temperature were highly significant (p<0.0005). Other endpoints are tantalizing, but not equally definitive (solar radiation p=0.1048). Finally, the diagnostic statistics plots reveal significant anomalies exist within the data set.

Let’s take a look at the plots (Figure 1):

Figure bean weather regression 2.jpg

Figure 1: output diagnostic plots. A RStudent represents the difference between variation with or without the data point (labeled by observation). B Cook’s D indicates the distance between what the regression would look with the observation and without the observation. C and D DFBETAS represents influence of each observation on individual parameters

Rstudent (panel A) shows a nice variance scatter, and most data points within potential outlier thresholds of plus or minus 2. None of the values are >3, which would be a greater concern. Cook’s D (panel B) shows about 20 data points considered “influential”. This means they notably change the direction of the regression line. Observation 314 bears special notice, especially since it was also an RStudent Outlier. Panels C and D display the by-variable, by-observation fit metric (DFBETAS). DFBETAS shows the influential observations by variable. Notably, point 314 flags for the intercept and other parameters, indicating it’s high yield is an overall anomaly given weather.

Why regression quality matters

We trace back point 314 to a 2004 bean harvest in the apante season in Rivas Department. Its yields are extraordinarily high, especially given the low rain and low relative humidity. Maybe the yield value was mis-recorded or the weather station broke. Or maybe the growers employed curve-busting technology, like drip irrigation. Proc REG provided our point of comparison. And formed the basis of to root out data quality issues.

Old timers are irreplaceable of crop forecasters. When they are absent, though, the best forecasts require the best data. Proc REG including handy diagnostic features that give us a way to correlate weather variables. Higher minimum temperatures and lower maximum temperatures clearly link to higher yields. Also, we can discern the effect of rain trends curvilinear, heavy rain negative, and relative humidity positive. But the original adjusted r-squared of 0.12 provides a poor basis for forecasting Nicaraguan bean yields against weather. By plotting and eliminating pesky outliers and influential data points, we can better connect weather to yield.

*Gourdji, Sharon; Läderach, Peter; Martinez Valle, Armando; Zelaya Martinez, Carlos; Lobell, David B. 2015. Replication Data for: Historical climate trends, deforestation, and maize and bean yields in Nicaragua. "Bean_dataset_new.xlsx". Replication Data for: Historical climate trends, deforestation, and maize and bean yields in Nicaragua, https://doi.org/10.7910/DVN/29206/UNZ63Q, Harvard Dataverse, V1.

JackHamilton · ‎06-02-2020

This uses a very peculiar style of comment:

*/generate and name various output plots*/;

It's valid, but at first glance it looks like you mistyped a /* ... */ comment as */ ... */, which is confusing. I think it would be better to stick to a more traditional way of commenting - and given the problems with * comments, the /* comment is the better choice. What makes it more confusing is that you start out with a /* */ comment and then switch to */ */;