Modeling two response variables simultaneously otherwise require sequential steps. One way to do this is to derive a covariance matrix for the first response variable, then use that in a generalized linear model for the second (see this breeding example). A simpler way is to derive a summary value somehow combining two responses, then model that derivative.
Various summary values can include:
Here I present an example to select for crop genotypes with higher protein based on “higher than expected protein based on yield.” These data represent designed field trials where crop protein and yield were measured as responses. The breeding goal is to select high protein lines that are also high yielding (there tends to be a protein versus yield among grain and pulse crops). In this case, genotypes were trialled at multiple locations, and therefore a ‘location’ term shows up as a covariate.
Similar methods for modeling “Grain Protein Deviation, (GPD)” have been used among breeders as described by Hänsel (2001) and Monaghan et al. (2001). The essence of the method is to first derive residuals for protein by yield, then use that residual variable as our response variable in a generalized linear model.
/* The field trial location is an important stratification, so we first sort the table by location, then, for the regression step, include location in the “by” statement. */ proc sort data=WORK.IMPORT out=Work.Sorted; by Location; run; /* Run a linear regression for protein (y) on the basis of yield (x). Each field plot (observation) represents a data point.*/ proc reg data=Work.Sorted plots(only)=(diagnostics residuals fitplot observedbypredicted); model Protein =Yield; output out=work.BestOfBoth p=Predicted student=Surprises; /*The goal of this output is to derive the studentized residual. I called the data point “surprises” since we seek lines of surprisingly high protein and yield.*/ by Location; /*Including the by statement for location is the first step to account for structural variation in the study. If location were not included, the most extreme locations would overly contribute to the residual variation, which is undesirable. One could also include other structural categories like region or block at this step. */ run;
These residuals represent the distance from the fit line of yield and protein. Positive numbers represent “good” genotypes (high protein with respect to yield), negatives represent “bad” genotypes and 0 represents average.
Prior to running a complex mixed model, this would be an opportune time to review leveraged outliers. Full panels are included to facilitate data exploration.
Using residuals as the response variable has the additional advantage of centering the data, making it more likely to abide by ANOVA assumptions. Heteroskedasticity or non-normality problems can me remediated by transforming the input variables or changing the regression strategy (e.g. quadratic or cubic models).
The next step is a standard Linear Mixed Model (LMM). Here we derive estimates for each genotype overall. Random factors of block within location, and its intercept, are included. Field studies often have multiple, hierarchical covariates; an example for spatial covariance is included as a comment.
proc mixed data=Work.BestOfBoth CL=WALD plots=all ; class BLOCK Genotype location; model Surprises = Genotype / ddfm=Satterthwaite; random int block/ subject=location; *Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */ lsmeans Genotype / cl; ods output lsmeans=Residuals; run; /*We derived genotype estimates through LSMeans. The model could also re-tooled for genomic selection. In that case, genotype would be named a random factor and its EBLUPs could be generated in concert with genomic relationship parameters.*/
LSMeans gave us our genotype estimates for the residuals (high protein and yield) given covariates of block and location.
Note the mixed models are the same, except for the response variable and we did not request confidence limits on the LSMeans statement.
proc mixed data=Work.BestOfBoth CL=WALD plots=all ; class BLOCK Genotype location; model Yield = Genotype / ddfm=Satterthwaite; random int block/ subject=location; *Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */ lsmeans Genotype / ; ods output lsmeans=Yieldmodel; run; proc mixed data=Work.BestOfBoth CL=WALD plots=all ; class BLOCK Genotype location; model Protein = Genotype / ddfm=Satterthwaite; random int block/ subject=location; *Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */ lsmeans Genotype / ; ods output lsmeans=Proteinmodel; run; /*Clean up the output data sets*/ data work.residuals; set work.residuals; resid=Estimate; residlow=lower; residup=upper; drop effect stderr df tvalue probt alpha estimate lower upper; run; data work.Yieldmodel; set work.Yieldmodel; Yield=Estimate; drop effect stderr df tvalue probt alpha estimate; run; data work.Proteinmodel; set work.Proteinmodel; Protein=Estimate; drop effect stderr df tvalue probt alpha estimate; run; proc sql; create table joined as select a.*, b.Protein, c.Yield from residuals a left join ProteinModel b on a.Genotype=b.Genotype left join Yieldmodel c on a.Genotype=c.genotype; quit;
We would select the highest values, right? Not quite. Because we’re interested in those with the highest protein given acceptable yield, we’ll set a floor for acceptable yield (say 625 yield units in this example) and a floor of protein. Then, we’ll rank the genotypes descending to shortlist the best.
One choice is whether to rank the residual itself, or the lower confidence interval. Because breeding designs are often unbalanced, I prefer to rank based on the lower confidence interval of the residual. This provides a stiffer litmus test for genotypes tested once or twice versus those tested a lot more.
proc sort data=joined; by descending residlow; run; proc rank data=WORK.joined ties=mean descending out=work.selections; var residlow; /*another strategy is to rank by residual estimate. Ranking by the lower confidence interval is a more conservative approach*/ ranks Selections; run; data selections; set selections; if yield>625 and protein>13 and Selections <21 then SelectLines='Selected'; else SelectLines='Not Selected'; run; proc print data=work.selections(obs=20); title "Lines optimizing protein with acceptable yield"; run; ods graphics / reset width=6.4in height=4.8in imagemap; proc sgplot data=WORK.SELECTIONS; scatter x=Yield y=Protein / group=SelectLines; xaxis grid; yaxis grid; run; ods graphics / reset;
And voila! We have selected lines that best break the protein drag on yield, that are also within an commercially acceptable range of yield and protein.
One thing that jumps out from the graph is that a one selected line is not better in terms of yield or protein than a close counterpart not selected. In some instances, this is because the selections were made on the lower confidence limit of the residual, while the plot shows the estimates.
Another factor is that the residuals were calculated from a regression prior to mixed modeling. The more complex field studies rich in covariates would “bake out” the error that was included in the calculation of the residual from the regression. In that case, reversal of step 2b with Step 1 could lead to more valid selections.
There are many methods in the arsenal to simultaneously deal with multiple response variables. Residuals are a convenient summary value to use as a response to select the “best of both worlds.”
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.