Solved: Graphs for PCA and RRR

catch18 · Posted 05-25-2020 08:54 PM

Hi,

Would you know of any graphical ways to present/compare components from principal component analysis and reduced rank regression?

Many thanks

Rick_SAS · Posted 06-02-2020 12:54 PM

The reason for the error is that the blog post analyzes a square correlation matrix. That is, looking at the pairs of correlations between the set of variables X1, X2, ..., Xp. Because you are using the VAR and WITH statements in PROC CORR, you are trying to analyze the correlations between pairs of variables X1, X2, X3, ..., and Y1, Y3, Y3 ,....

I've tried to quickly modify the program to work for the WITH statement, but I didn't test it. The data for the example are from the Getting Started example for PROC PLS. Hopefully, you can modify this example to suit your needs.

/* Getting Started Example: Predicting Biological Activity */

data Sample;
   input obsnam $ v1-v27 ls ha dt @@;
ETC ... DOWLONAD THE EXAMPLE
;

/* Fitting a PLS Model */
%let YVars = ls ha dt;
proc pls data=sample  method=PCR varss details plot=corrload(nfac=3 trace=on);
   model &YVars = v1-v27;
   output out=pattern_morph1 xscore=scorePred yscore=scoreResp;
run;

 proc corr data=pattern_morph1; /* pairwise correlation */
var &YVars;
with scorePred1-scorePred3;
ods output PearsonCorr = Corr; /* write correlations, p-values, and sample sizes to data set */
run;

proc iml;
ColNames = {&YVars};
use Corr;
read all var "Variable" into RowNames;
read all var {&YVars} into mCorr; /* matrix of correlations */
ProbNames = "P"+ColNames; /* variables for p-values are named PX, PY, PZ, etc */
read all var (ProbNames) into mProb; /* matrix of p-values */
close Corr;

numCols = ncol(mCorr); /* number of variables */
numRows = nrow(mCorr); /* number of variables */
numPairs = numCols*numRows;
length = nleng(ColNames) + nleng(RowNames) + 5; /* max length of new ID variable */
PairNames = j(NumPairs, 1, BlankStr(length));
i = 1;
do row= 1 to numCols; /* construct the pairwise names */
   do col = 1 to numCols;
      PairNames[i] = strip(ColNames[col]) + " vs. " + strip(RowNames[row]);
      i = i + 1;
   end;
end;
print PairNames;


Corr = mCorr[ 1:numPairs ];
Prob = mProb[ 1:numPairs ];
Significant = choose(Prob > 0.05, "No ", "Yes"); /* use alpha=0.05 signif level */

create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"};
append;
close;
 
proc sort data=CorrPairs;  by Corr;  run;

title "Pairwise Correlations";
proc sgplot data=CorrPairs;
hbar PairNames / response=Corr group=Significant;
refline 0 / axis=x;
yaxis discreteorder=data display=(nolabel) 
      labelattrs=(size=6pt) fitpolicy=none 
      offsetmin=0.012 offsetmax=0.012  /* half of 1/k, where k=number of catgories */
      colorbands=even colorbandsattrs=(color=gray transparency=0.9);
xaxis grid display=(nolabel);
keylegend / position=topright location=inside across=1;
run;

View solution in original post

Ksharp · Posted 05-26-2020 07:13 AM

Calling @Rick_SAS

catch18 · Posted 05-26-2020 02:55 PM

Thanks, @Ksharp.

PaigeMiller · Posted 05-26-2020 07:22 AM

I doubt you can compare Principal Components Analysis (PCA) with Reduced Rank Regression (RRR) as they are very different techniques and not comparable.

You could compare Principal Components Regression (PCR) with Reduced Rank Regression (RRR) on how well they predict the Y variable. Simply examine the fit statistics. Or plot the residuals of both on the same graph. Or you could (if you were using SAS Enterprise Miner) use a Compare Node and let EMiner do the calculations/plots for you.

--
Paige Miller

catch18 · Posted 05-26-2020 03:07 PM

Thanks for your response, @PaigeMiller.

I completely agree that both methods are different, but I'm comparing their strengths in predicting the response variable/outcome of interest.

PaigeMiller · Posted 05-26-2020 03:25 PM

@catch18 wrote:

Thanks for your response, @PaigeMiller.

I completely agree that both methods are different, but I'm comparing their strengths in predicting the response variable/outcome of interest.

There is no response variable in PCA, so you can't compare PCA to predicting a response variable using other methods. That's why I mentioned PCR, which does have a response variable, and you ought to be considering it (and maybe that's what you meant all along).

As long as we're on the topic, you also ought to compare both of these to PLS. The benefit of PLS over both PCR and RRR is that PCR and RRR find reduced dimensional spaces of the X matrix without regard to whether or not these are good predictors of the response, while PLS finds a reduced dimensional spaces of the X matrix that are good predictors of the response (if there are any). So if the issue is really predictability, PLS ought to do better than RRR or PCR.

--
Paige Miller

catch18 · Posted 05-26-2020 03:35 PM

Yes, I'm using components derived from PCR via PLS in the comparison.

Thanks.

catch18 · Posted 05-26-2020 03:51 PM

Yes, I have also run proc pls using the pls method. Sorry if my earlier statement was confusing.

Rick_SAS · Posted 05-26-2020 07:24 AM

Here's one place to start: "How to interpret graphs in a principal component analysis."

The Getting Started section of the SAS/STAT documentation for PROC PRINCOMP is another place. The PLS procedure in SAS/STAT supports many many graphs. Again, see the Getting Started example in the doc.

catch18 · Posted 05-26-2020 03:14 PM

I now have many options to choose from, thank you, Rick!

I initially run the RRR without splitting the sample and obtained 4 factors from 4 response variables.

However, after splitting the sample as follows:

proc pls data=morph method=RRR CV=split Cvtest(seed=12345)

nfac=4 varss details;

model exclbf PBf exclFf Sol = &xlist;

output out=pattern_morph1 xscore=scorePred yscore=scoreResp;

run;

The minimum number of factors is zero based on minimum root mean PRESS? My sample size is 350.

Thanks

catch18 · Posted 05-26-2020 05:41 PM

Also, I'm trying to use the correlation loading plot, but it appears my observations are not well spaced out and I can't see the labelled response variables within the plot to assess correlations. I only added

plot=corrload(nfac=3) to the data line in the above code.

Many thanks.

catch18 · Posted 06-02-2020 08:47 AM

Alright, so I haven't found any solutions for my correlation loading plot yet. I have a lot of predictor variables (reason for using the dimensionality reduction technique) which mask my response variables in the plot.

I was wondering if I could use the output factors/components derived from each method, i.e. pcr and rrr to plot correlations with the Y response variables as horizontal bars in proc sgplot?

Many thanks.

Emma

Rick_SAS · Posted 06-02-2020 09:05 AM

If you don't have too many variables, you can order the pairwise correlations and use a bar chart to visualize, as shown in the article "Use a bar chart to visualize pairwise correlations ."

The same article shows a heat map of the correlation matrix, which is an alternative visualization.

I don't know exactly what you are doing, but some dimension-reduction plots project the response variable onto the span of the first two dimensions for the reduced space. (For example, the first two principal components.) You state: "I have a lot of predictor variables ... which mask my response variables in the plot." I don't know what you mean by "mask," but one interpretation is that the first two dimensions are not good predictors of the responses. If so, the projection of the responses will be near the origin and hard to see. In other words, the problem you are seeing might not be a visualization problem, it might be a statistical problem caused because your model does not fit the data.

catch18 · Posted 06-02-2020 10:19 AM

Wow, I think your latter comments could be the explanation. However, before I ask further question on that I tried to do the pairwise correlations as per your link.

This was my original model:

ODS graphics on;

proc pls data=work method=RRR varss details plot=corrload(nfac=3 trace=on);

model exclbf PBf6 exclFf Sol = &xlist;

output out=pattern_morph1(rename=(scorePred1=Health scorePred2=Junk scorePred3=Mixed)) xscore=scorePred yscore=scoreResp;

run;

ods exclude all;

proc corr data=pattern_morph1; /* pairwise correlation */

var Health Junk Mixed;

with scoreResp1 scoreResp2 scoreResp3;

ods output PearsonCorr = Corr; /* write correlations, p-values, and sample sizes to data set */

run;

ods exclude none;

numCols = ncol(mCorr); /* number of variables */

numPairs = numCols*(numCols-1) / 2;

length = 2*nleng(ColNames) + 5; /* max length of new ID variable */

PairNames = j(NumPairs, 1, BlankStr(length));

i = 1;

do row= 2 to numCols; /* construct the pairwise names */

do col = 1 to row-1;

PairNames[i] = strip(ColNames[col]) + " vs. " + strip(ColNames[row]);

i = i + 1;

end;

lowerIdx = loc(row(mCorr) > col(mCorr)); /* indices of lower-triangular elements */

Corr = mCorr[ lowerIdx ];

Prob = mProb[ lowerIdx ];

Significant = choose(Prob > 0.05, "No ", "Yes"); /* use alpha=0.05 signif level */

Unlike the correlation loading plot, I'm not sure if it's best to use the original response variables or the Y scores from modelling?

create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"};

append;

close;

QUIT;

Rick_SAS · Posted 06-02-2020 10:28 AM

I don't see a question in your latest post, but your program is missing the PROC IML statement and the lines that read the output from PROC CORR.

Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Re: Graphs for PCA and RRR

Ready to join fellow brilliant minds for the SAS Hackathon?