Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- SAS Communities Library
- /
- Simultaneous selection for response variables

Options

- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content

- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content

Views
4,367

Models almost always have one ‘target,’ ‘response’ or ‘dependent’ variable. Exceptions include multivariate multiple regression and partial least squares (PLS) regression.

Modeling two response variables simultaneously otherwise require sequential steps. One way to do this is to derive a covariance matrix for the first response variable, then use that in a generalized linear model for the second (see this breeding example). A simpler way is to derive a summary value somehow combining two responses, then model that derivative.

Various summary values can include:

- Weighted means
- Principal component eigenvalues
- Residuals of a regression (the subject of this piece)

Here I present an example to select for crop genotypes with higher protein based on “higher than expected protein based on yield.” These data represent designed field trials where crop protein and yield were measured as responses. The breeding goal is to select high protein lines that are also high yielding (there tends to be a protein versus yield among grain and pulse crops). In this case, genotypes were trialled at multiple locations, and therefore a ‘location’ term shows up as a covariate.

Similar methods for modeling “Grain Protein Deviation, (GPD)” have been used among breeders as described by Hänsel (2001) and Monaghan et al. (2001). The essence of the method is to first derive residuals for protein by yield, then use that residual variable as our response variable in a generalized linear model.

```
/* The field trial location is an important stratification, so we first sort the table by location, then, for the regression step, include location in the “by” statement. */
proc sort data=WORK.IMPORT out=Work.Sorted;
by Location;
run;
/* Run a linear regression for protein (y) on the basis of yield (x). Each field plot (observation) represents a data point.*/
proc reg data=Work.Sorted plots(only)=(diagnostics residuals fitplot observedbypredicted);
model Protein =Yield;
output out=work.BestOfBoth p=Predicted student=Surprises;
/*The goal of this output is to derive the studentized residual. I called the data point “surprises” since we seek lines of surprisingly high protein and yield.*/
by Location;
/*Including the by statement for location is the first step to account for structural variation in the study. If location were not included, the most extreme locations would overly contribute to the residual variation, which is undesirable. One could also include other structural categories like region or block at this step. */
run;
```

These residuals represent the distance from the fit line of yield and protein. Positive numbers represent “good” genotypes (high protein with respect to yield), negatives represent “bad” genotypes and 0 represents average.

Prior to running a complex mixed model, this would be an opportune time to review leveraged outliers. Full panels are included to facilitate data exploration.

Using residuals as the response variable has the additional advantage of centering the data, making it more likely to abide by ANOVA assumptions. Heteroskedasticity or non-normality problems can me remediated by transforming the input variables or changing the regression strategy (e.g. quadratic or cubic models).

The next step is a standard Linear Mixed Model (LMM). Here we derive estimates for each genotype overall. Random factors of block within location, and its intercept, are included. Field studies often have multiple, hierarchical covariates; an example for spatial covariance is included as a comment.

```
proc mixed data=Work.BestOfBoth CL=WALD plots=all ;
class BLOCK Genotype location;
model Surprises = Genotype / ddfm=Satterthwaite;
random int block/ subject=location;
*Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */
lsmeans Genotype / cl;
ods output lsmeans=Residuals;
run;
/*We derived genotype estimates through LSMeans. The model could also re-tooled for genomic selection. In that case, genotype would be named a random factor and its EBLUPs could be generated in concert with genomic relationship parameters.*/
```

LSMeans gave us our genotype estimates for the residuals (high protein and yield) given covariates of block and location.

Note the mixed models are the same, except for the response variable and we did not request confidence limits on the LSMeans statement.

```
proc mixed data=Work.BestOfBoth CL=WALD plots=all ;
class BLOCK Genotype location;
model Yield = Genotype / ddfm=Satterthwaite;
random int block/ subject=location;
*Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */
lsmeans Genotype / ;
ods output lsmeans=Yieldmodel;
run;
proc mixed data=Work.BestOfBoth CL=WALD plots=all ;
class BLOCK Genotype location;
model Protein = Genotype / ddfm=Satterthwaite;
random int block/ subject=location;
*Repeated / Subject=intercept Type=SP(Exp) (row col); /*for studies with a high genotype to replicate ratio, we would account also want to account for spatial variation – plot row and column placements. */
lsmeans Genotype / ;
ods output lsmeans=Proteinmodel;
run;
/*Clean up the output data sets*/
data work.residuals;
set work.residuals;
resid=Estimate;
residlow=lower;
residup=upper;
drop effect stderr df tvalue probt alpha estimate lower upper;
run;
data work.Yieldmodel;
set work.Yieldmodel;
Yield=Estimate;
drop effect stderr df tvalue probt alpha estimate;
run;
data work.Proteinmodel;
set work.Proteinmodel;
Protein=Estimate;
drop effect stderr df tvalue probt alpha estimate;
run;
proc sql;
create table joined as
select a.*, b.Protein, c.Yield
from residuals a
left join ProteinModel b
on a.Genotype=b.Genotype
left join Yieldmodel c
on a.Genotype=c.genotype;
quit;
```

** **

We would select the highest values, right? Not quite. Because we’re interested in those with the highest protein given acceptable yield, we’ll set a floor for acceptable yield (say 625 yield units in this example) and a floor of protein. Then, we’ll rank the genotypes descending to shortlist the best.

One choice is whether to rank the residual itself, or the lower confidence interval. Because breeding designs are often unbalanced, I prefer to rank based on the lower confidence interval of the residual. This provides a stiffer litmus test for genotypes tested once or twice versus those tested a lot more.

```
proc sort data=joined;
by descending residlow;
run;
proc rank data=WORK.joined ties=mean descending out=work.selections;
var residlow; /*another strategy is to rank by residual estimate. Ranking by the lower confidence interval is a more conservative approach*/
ranks Selections;
run;
data selections;
set selections;
if yield>625 and protein>13 and Selections <21 then SelectLines='Selected'; else SelectLines='Not Selected';
run;
proc print data=work.selections(obs=20);
title "Lines optimizing protein with acceptable yield";
run;
ods graphics / reset width=6.4in height=4.8in imagemap;
proc sgplot data=WORK.SELECTIONS;
scatter x=Yield y=Protein / group=SelectLines;
xaxis grid;
yaxis grid;
run;
ods graphics / reset;
```

And voila! We have selected lines that best break the protein drag on yield, that are also within an commercially acceptable range of yield and protein.

One thing that jumps out from the graph is that a one selected line is not better in terms of yield or protein than a close counterpart not selected. In some instances, this is because the selections were made on the lower confidence limit of the residual, while the plot shows the estimates.

Another factor is that the residuals were calculated from a regression prior to mixed modeling. The more complex field studies rich in covariates would “bake out” the error that was included in the calculation of the residual from the regression. In that case, reversal of step 2b with Step 1 could lead to more valid selections.

There are many methods in the arsenal to simultaneously deal with multiple response variables. Residuals are a convenient summary value to use as a response to select the “best of both worlds.”

Comments

08-21-2021
11:32 AM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content

08-21-2021
11:32 AM

Is there a paper that explains the underlying concept?

08-25-2021
07:00 AM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content

08-25-2021
07:00 AM

thanks for the awesome information.

https://krogerfeedback.nl https://talktosonic.onl https://talktowendys.vip https://whataburgersurvey.onl

08-25-2021
07:20 AM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content

08-25-2021
07:20 AM

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 25. Read more here about **why** you should contribute and **what is in it** for you!

Data Literacy is for **all**, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.

Article Labels

Article Tags