Hi, I have a question on the reduced rank regression (RRR), could you help me? Thanks a lot!
In Proc PLS, there are three method options: PLS (Partial Least Squares), RRR, PCR. In some manuscripts, they mentioned the PLS could be used when the sample size is small. The PLS and RRR seems to have same theory. Is the RRR also suit for small samples? For example, the n=150, the number of independent variable (x) is 25, and the number of dependent variable (y=5).
I don't know why they made that statement. There is no reference next to the assertion (Jankovic et al, 2014) that RRR requires a large sample size. In that section, they are making excuses for why they didn't get a significant result.
It is possible that they did an internet search for "small sample size RRR" and accidentally based that statement on Relative Risk Reduction, which is another statistical procedure that has the same three-letter acronym.
In all high-dimensional multivariate analyses, a "small sample size" depends on the number of variables. The more variables you have, the more observations you need. So N=1000 is a small sample size if your model includes 100 effects, but it is a large sample size if your model contains five effects.
I am not an expert on RRR and I defer to others who use PLS-based methods more often (such as @PaigeMiller ).
From a purely mathematical perspective, regression methods are ways of projecting response data onto a certain subspace of the regressors. The regression estimates are the values needed to express the projected responses as a linear combination of certain basis elements. As such, regression does not require a large sample size. For OLS, you only need as many observations as there are effects in the model.
Large samples become important when you want to make inferential statements about the precision of the estimates. Standard errors are defined regardless of the sample size, but hypothesis tests, p-values, and some confidence intervals are often known only asymptotically for large samples.
So my answer is that the sample size isn't important if you are using the model for predictions. It would be important if you want to test hypotheses or form confidence intervals for parameters.
These comments are not specific to RRR. They are general comments about how regression works.
Agreeing with @Rick_SAS and adding one point he did not mention.
An outlier may have a huge impact on the regression model with a small sample size. With a much larger sample size, this outlier has less impact. Similarly, if the X-data is very variable (even without outliers) it may have a huge impact on the regression model with a small sample size, but much less impact with a large sample size.
I suppose the above is true for all regression models, not just PLS or RRR.
So checking for outliers among the X variables is critical in the case of small sample sizes. Fortunately there are built-in methods to do this in PROC PLS, the T-Squared statistic will find outliers, weighted in the directions of good predictor variables, while the STDXSSE statistic will find outliers weighted in the directions of the weaker predictor variables. (And, just to be clear, you need to check for outliers with large sample sizes too, but the outliers will have smaller impact.)
Lastly (I know this was not asked but I mention it anyway) — I prefer PLS to RRR for a number of reasons, if you want to talk about that let me know.
To clarify Paige's statement, "outliers in the X variables" are typically called "high-leverage points" in the literature. Some researchers reserve the word "outliers" for extreme values in the response variable(s).
High-leverage points might or might not affect the regression. It depends on whether the response value at a high-leverage point is close to the prediction of the model without using the point.
For example, consider the simple regression with points
(1,1), (2,2), (3,3), (100, 101)
The observation with x=100 is a high-leverage point, but it is a "good" high-leverage point because y=101 is close to the predicted values when you fit the model to only the first three observations. In contrast, for the points
(1,1), (2,2), (3,3), (100, 200)
the observation with x=100 is a "bad" leverage point because its inclusion dramatically changes the regression line.
Paige's observation is relevant for both leverage points and for outliers: Models fit to small samples can be heavily influenced by extreme values in Xs and in Ys.
For more information about outliers and leverage points in OLS models, see "Identify influential observations in regression models." The ROBUSTREG procedure can produce a diagnostic plot that identifies outliers and good/bad leverage points.
Terminology aside, the benefit of using T-squared and STDXSSE from PROC PLS is that it finds not only univariate outliers in X/high leverage points, but it finds multivariate outliers in X/high leverage points. In other words, these points are not extreme when looked at one variable at a time, but are extreme when all variables are considered.
As far as I know, the methods in PROC REG and PROC ROBUSTREG to find influential points only finds univariate influential points but does not find multivariate influential points.
Nevertheless, it does help to identify if a point is a "good" leverage point or a "bad" leverage point.
> As far as I know, the methods in PROC REG and PROC ROBUSTREG to find influential points only finds univariate influential points but does not find multivariate influential points.
I don't want to usurp this thread, but I think this is not correct.
Briefly, a high-leverage point is a point in X space that has a large Mahalanobis distance (MD) to the mean. The leverage statistic is closely related to the MD (as is the T^2 statistic). Both ROBUSTREG and REG identify multivariate leverage points. But REG can only identify them only when including the influential observation does not mask its presence. You can use the RStudentByLeverage plot in PROC REG to visualize the high-leverage and outliers.
We should probably discuss these tangential issues at another time.
@KKAY wrote:
@PaigeMiller
Thank you very much! I want to use the RRR in Proc Pls to build dietary model. That might be a better choice to use RRR if want to get a better scores of dependent variables. Am I right?
While, I know the the SAS give an example of cross validation when use method=PLS. Does it work when I use method=RRR?
"Better" is not really a word I would use. You will get different scores between RRR and PLS, and from some points of view, RRR might be the one you want while from other points of view PLS might be the one you want, but its not clear what you are looking for.
If your goal is to determine Y scores ... I'll have to think about if I prefer RRR or if I prefer PLSs. If your goal is to predict the response variable(s), I still prefer PLS. What is your goal here when you create the model, is it to predict the Y-values, or rank the people based on Y-scores, or something else?
I believe the cross-validation works for RRR, but I have never tried. So, I give you the task of seeing if it works for RRR.
@KKAY wrote:
I want to use a previous dietary data to build a dietary mode firstly. For example, X are 20 kind groups of food intake. Y are the target nutrients intake, energy density and so on. After geting the xloading then applying this xloading into a new dietary data (which divided their food in same 20 kind group). You know the process to connect the xloading with intake of food group in new data to get dietary factor scores. Then to divide the dietary factor scores based on quintile and set them as independent variables to bulid regression models. In these examples, disease situation also be set as outcome variable. We want to explore the relationship between the dietary mode (dietary factor scores quintile group) with the disease. (the relationship could also be used to predict in some situation)
The sentence I highlighted in red seems to be the answer to my question. And so, I prefer PLS over RRR. I'll explain that in a minute, but in any modeling situation, I would also say that there's no way to know which method is going to work better on a given problem, unless you actually try both and then compare the results.
PLS finds linear combinations of the X-variables that are predictive of linear combinations of the Y-variables. (That's a layman's explanation, the actual math is a little different, but I'm trying to put forth ideas and not math right now). RRR, as I understand it, finds those linear combinations of the X-variables based upon the correlations in the Y-matrix, and so doesn't address the problem of multi-collinearity amongst the X-variables, which can then result in potentially misleading interpretations of the X-variables. For example, you can find situations where multi-collinearity among the X-variables cause the coefficient of a particular predictor to have the wrong sign (it is supposed to have a positive sign, but because of multi-collinearity it has negative sign). Even if the sign is correct, the coefficient can be very inflated or deflated (causing a coefficient to appear to be not statistically significant when it is a good predictor; and vice versa), because of the high variability caused by the multi-collinearity.
I don't know why they made that statement. There is no reference next to the assertion (Jankovic et al, 2014) that RRR requires a large sample size. In that section, they are making excuses for why they didn't get a significant result.
It is possible that they did an internet search for "small sample size RRR" and accidentally based that statement on Relative Risk Reduction, which is another statistical procedure that has the same three-letter acronym.
In all high-dimensional multivariate analyses, a "small sample size" depends on the number of variables. The more variables you have, the more observations you need. So N=1000 is a small sample size if your model includes 100 effects, but it is a large sample size if your model contains five effects.
Yes, in the manuscript (Jankovic et al, 2014) they do not give a reference when they mentioned "RRR requires a reasonable sample size for an appropriate examination".
I saw some researchers discussed the topic "Small sample size in PLS - any thoughts?" in a post (https://www.researchgate.net/post/Small_sample_size_in_PLS-any_thoughts). So I have this question.
As both PLS and RRR could explain the relationship between X (X1 to Xn) with Y (Y1 to Yn'). For example, in RRR, the X1 to X20 are set as continuous variables rather than classific variables, and the Y1 to Y6 are also set as continuous variables. So the X1 to X20 will be explained by the dietary pattern (less than the 6 types).
According to your suggestion, is the number of effects 20? or 26? ——and If according to the “Mansour Zarra-Nezhad” who give a formula in the post above, |10 multiplied by the maximum number of indicators of a latent variable", is it ok?
When I try to follow your link, I get "post unaccessible" error. So I can't read the discussion.
If Y1-Yk are the response variables and X1-Xp are the predictor variables, then there are k+p variables in the model. So the geometry is in (k+p)-dim space. It's a little complicated, however, because these methods use "dimension reduction techniques" to project the variables onto lower-dimensional linear subspaces. But if someone asked me "how many [original] effects," I would say 20+6=26.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.