Simultaneous selection for response variables

1 Like

What are some ways you can model multiple response variables?

Models almost always have one ‘target,’ ‘response’ or ‘dependent’ variable. Exceptions include multivariate multiple regression and partial least squares (PLS) regression.

Modeling two response variables simultaneously otherwise require sequential steps. One way to do this is to derive a covariance matrix for the first response variable, then use that in a generalized linear model for the second (see this breeding example). A simpler way is to derive a summary value somehow combining two responses, then model that derivative.

Various summary values can include:

Weighted means
Principal component eigenvalues
Residuals of a regression (the subject of this piece)

Breeding – break the yield drag by selecting two phenotypes with a tradeoff

Here I present an example to select for crop genotypes with higher protein based on “higher than expected protein based on yield.” These data represent designed field trials where crop protein and yield were measured as responses. The breeding goal is to select high protein lines that are also high yielding (there tends to be a protein versus yield among grain and pulse crops). In this case, genotypes were trialled at multiple locations, and therefore a ‘location’ term shows up as a covariate.

Similar methods for modeling “Grain Protein Deviation, (GPD)” have been used among breeders as described by Hänsel (2001) and Monaghan et al. (2001). The essence of the method is to first derive residuals for protein by yield, then use that residual variable as our response variable in a generalized linear model.

Step 1: Derive run a linear regression to derive residuals

/* The field trial location is an important stratification, so we first sort the table by location, then, for the regression step, include location in the “by” statement. */
proc sort data=WORK.IMPORT out=Work.Sorted;
	by Location;
run;
/* Run a linear regression for protein (y) on the basis of yield (x). Each field plot (observation) represents a data point.*/
proc reg data=Work.Sorted plots(only)=(diagnostics residuals fitplot observedbypredicted);
	model Protein =Yield;
	output out=work.BestOfBoth p=Predicted student=Surprises; 
/*The goal of this output is to derive the studentized residual. I called the data point “surprises” since we seek lines of surprisingly high protein and yield.*/
	by Location;
/*Including the by statement for location is the first step to account for structural variation in the study. If location were not included, the most extreme locations would overly contribute to the residual variation, which is undesirable. One could also include other structural categories like region or block at this step. */
run;

These residuals represent the distance from the fit line of yield and protein. Positive numbers represent “good” genotypes (high protein with respect to yield), negatives represent “bad” genotypes and 0 represents average.

Prior to running a complex mixed model, this would be an opportune time to review leveraged outliers. Full panels are included to facilitate data exploration.

Using residuals as the response variable has the additional advantage of centering the data, making it more likely to abide by ANOVA assumptions. Heteroskedasticity or non-normality problems can me remediated by transforming the input variables or changing the regression strategy (e.g. quadratic or cubic models).

Step 2a: Model those residuals

The next step is a standard Linear Mixed Model (LMM). Here we derive estimates for each genotype overall. Random factors of block within location, and its intercept, are included. Field studies often have multiple, hierarchical covariates; an example for spatial covariance is included as a comment.

proc mixed data=Work.BestOfBoth CL=WALD plots=all ;
	class BLOCK Genotype location;
	model Surprises =  Genotype / ddfm=Satterthwaite;
    random int block/ subject=location; 
	*Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */
    lsmeans Genotype / cl;
    ods output lsmeans=Residuals; 
run;
/*We derived genotype estimates through LSMeans. The model could also re-tooled for genomic selection. In that case, genotype would be named a random factor and its EBLUPs could be generated in concert with genomic relationship parameters.*/

LSMeans gave us our genotype estimates for the residuals (high protein and yield) given covariates of block and location.

Step 2b: run mixed models for the two original response variables, and merge those into one table of LSMeans.

Note the mixed models are the same, except for the response variable and we did not request confidence limits on the LSMeans statement.

proc mixed data=Work.BestOfBoth CL=WALD plots=all ;
	class BLOCK Genotype location;
	model Yield =  Genotype / ddfm=Satterthwaite;
    random int block/ subject=location; 
	*Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */
    lsmeans Genotype / ;
    ods output lsmeans=Yieldmodel; 
run;

proc mixed data=Work.BestOfBoth CL=WALD plots=all ;
	class BLOCK Genotype location;
	model Protein =  Genotype / ddfm=Satterthwaite;
    random int block/ subject=location; 
	*Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */
    lsmeans Genotype / ;
    ods output lsmeans=Proteinmodel; 
run;
/*Clean up the output data sets*/
data work.residuals;
	set work.residuals;
	resid=Estimate;
	residlow=lower;
	residup=upper;
	drop effect stderr df tvalue probt alpha estimate lower upper;
run;
data work.Yieldmodel;
	set work.Yieldmodel;
	Yield=Estimate;
	drop effect stderr df tvalue probt alpha estimate;
run;
data work.Proteinmodel;
	set work.Proteinmodel;
	Protein=Estimate;
	drop effect stderr df tvalue probt alpha estimate;
run;
proc sql;
    create table joined as 
    select a.*, b.Protein, c.Yield
    from residuals a 
        left join ProteinModel b 
        on a.Genotype=b.Genotype
        left join  Yieldmodel c
        on a.Genotype=c.genotype;
quit;

Step 3: Grab the best of the best

We would select the highest values, right? Not quite. Because we’re interested in those with the highest protein given acceptable yield, we’ll set a floor for acceptable yield (say 625 yield units in this example) and a floor of protein. Then, we’ll rank the genotypes descending to shortlist the best.

One choice is whether to rank the residual itself, or the lower confidence interval. Because breeding designs are often unbalanced, I prefer to rank based on the lower confidence interval of the residual. This provides a stiffer litmus test for genotypes tested once or twice versus those tested a lot more.

proc sort data=joined;
	by descending residlow;
run;
proc rank data=WORK.joined ties=mean descending out=work.selections;
	var residlow; /*another strategy is to rank by residual estimate. Ranking by the lower confidence interval is a more conservative approach*/
	ranks Selections;
run;
data selections;
	set selections;
	if yield>625 and protein>13 and  Selections <21 then SelectLines='Selected'; else SelectLines='Not Selected';
run;

proc print data=work.selections(obs=20);
	title "Lines optimizing protein with acceptable yield";
run;
ods graphics / reset width=6.4in height=4.8in imagemap;

proc sgplot data=WORK.SELECTIONS;
	scatter x=Yield y=Protein / group=SelectLines;
	xaxis grid;
	yaxis grid;
run;
ods graphics / reset;

And voila! We have selected lines that best break the protein drag on yield, that are also within an commercially acceptable range of yield and protein.

One thing that jumps out from the graph is that a one selected line is not better in terms of yield or protein than a close counterpart not selected. In some instances, this is because the selections were made on the lower confidence limit of the residual, while the plot shows the estimates.

Protein-yield selections.jpg Protein-yield selections table.jpg

Another factor is that the residuals were calculated from a regression prior to mixed modeling. The more complex field studies rich in covariates would “bake out” the error that was included in the calculation of the residual from the regression. In that case, reversal of step 2b with Step 1 could lead to more valid selections.

Conclusion

There are many methods in the arsenal to simultaneously deal with multiple response variables. Residuals are a convenient summary value to use as a response to select the “best of both worlds.”

PaigeMiller · ‎08-21-2021

Is there a paper that explains the underlying concept?

jozgot · ‎08-21-2021

Hi PaigeMiller

Thanks for your engagement! Hänsel (2001) and Monaghan et al. (2001) take it on directly. Michel et al. (2019) reviews the concept alongside other approaches.

Best regards,

John

jackyjoy123 · ‎08-25-2021

thanks for the awesome information.

https://krogerfeedback.nl https://talktosonic.onl https://talktowendys.vip https://whataburgersurvey.onl

PaigeMiller · ‎08-25-2021

Hi, @jozgot , thanks, I read the paper. As I have never worked in biological sciences, I had never come across these methods before. I did work in chemical manufacturing, where often there were two response variables that were negatively correlated, and yet the "goal" was to make both responses as large as possible, which is similar to the problem you discuss. In that case, we relied on an optimization (which is specifically available in JMP) to obtain the "best" result across the variables. Do you have any thoughts on how the method you discuss compares to the optimization in JMP?