Hi! I have a data set with more variables than observations. I would like to compute "Mahalanobis" distance between all pairs of points in the data, and between arbitrary new points from the "same population" (not just between the individual points and the location center estimator). Can the MCD call do this? Is there another way?
When n > p it is easy to do this for standard Mahalanobis distance in SAS/STAT using PROC PRINCOMP.
Thanks!
http://blogs.sas.com/content/iml/2012/02/22/how-to-compute-mahalanobis-distance-in-sas.html
It seems like you can't using the standard functions, which means rolling your own. The post above shows how to do that for the old one so may be helpful in designing your own function and explicitly covers your situtation regarding distance between points, but not the case of N>p.
PS. I don't see why this isn't mathematically possible, but I question if it's a good idea when most procedures can't do it for a specific reason.
SAS/IML supports the MAHALANOBIS function, which computes the Mahalanobis distance (MD) for a set of multivariate data.
The problem you will encounter is that when p > N, the sample covariance matrix is going to be singular because the data spans at most an N-dimensional subspace of p-dimensional space. You can see this by running the following program, which simulates N=5 observations in a 10-dimensonal space and prints out the Cholesky decomposition:
proc iml;
call randseed(12345);
p = 10;
N = 5;
mean = j(1,p,0);
cov = I(p);
X = randnormal(N, mean, cov); /* simulate MVN with p > N */
/* show that covariance matrix is singular */
S = cov(X);
G = root(S); /* Cholesky decomposition */
reset fuzz;
print G;
IF YOU KNOW THE POPULATION PARAMETERS, then you can solve the problem because then you don't need to estimate the covariance from the sample. The following uses the population mean and covariance to compute the Mahalanobis distances between observations in the sample:
md = mahalanobis(X, mean, cov);
print md;
In simulation studies, you know the population parameters, but in real life you usually don't. It doesn't matter whether you use a classical estimator or a robust estimator (like MCD), the estimates of a general covariance matrix with p > N will be singular.
Hi Rick_SAS! Thank you for responding. Can MCD do anything meaningful for p > n? All determinants are going to be zero anyway. I thought maybe it would do some shrinkage with a small diagonal. I suppose I could do that manually, perhaps create a "Cauchy sequence" of smaller and smaller diagonals, compute Mahalanobis and / or robust distances at each value and see whether they converge or blow up.
If I have a full-rank covariance matrix, with eigenvalues t[j] and normalized eigenvectors u[j], 1 <= j <= p, then for any vectors x and y in my data set, I can express (x-y) as a linear combination of the eigenvectors, x-y = c[1]*u[1] + c[2]*u[2] + ... + c[p]*u[p]. Then the square of the Euclidean distance between x and y is (c[1]**2 + c[2]**2 + ... + c[p]**2), and the square of the Mahalanobis distance is (c[1]**2/t[1] + c[2]**2/t[2] + ... + c[p]**2/t[p]). For p > n, the covariance matrix is not full-rank, but suppose it has maximum rank n. Then the n data vectors are part of an n-dimensional subspace, I should be able to produce something like a Mahalanobis distance in this reduced subspace; after all, I can always compute Euclidean distance, and Mahalanobis just rescales the Euclidean axes. In particular, if the n-dimensional subspace is spanned by the covariance eigenvectors with positive eigenvalues, I could just use the Mahalanobis formula with the non-zero eigenvalues. (It doesn't seem trivial to prove or disprove this conjecture about the eigenvectors, although I would think that the question was settled long ago.)
You seem to have a solid understanding of linear algebra. You recognize that if p > n then there are at most n-1 linearly independent eigenvectors, which I will call the eigenbasis. Here's what I would suggest:
1. Change coordinates. Represent the data in terms of the projection onto the span of the eigenbasis. In this coordinate system, each observation has n-1 coordinates. [Statisticians call these coordinate the "principal coordinate scores."]
2. Do a regular Mahalanobis distance analysis in this subspace. Or, if there are outliers in this subspace, use MCD to estimate a robust covariance matrix within the subspace.
See "Rediscovering SAS/IML Software" (Wicklin, 2010, p. 7-10) for SAS/IML programs that show how to conduct robust outlier detection and robust PCA. Your situation is similar, but you'll first project onto the eigenspace and then do the subsequent analysis in that coordinate system.
@Rick_SAS wrote:
You seem to have a solid understanding of linear algebra. You recognize that if p > n then there are at most n-1 linearly independent eigenvectors, which I will call the eigenbasis. Here's what I would suggest:
1. Change coordinates. Represent the data in terms of the projection onto the span of the eigenbasis. In this coordinate system, each observation has n-1 coordinates. [Statisticians call these coordinate the "principal coordinate scores."]
2. Do a regular Mahalanobis distance analysis in this subspace. Or, if there are outliers in this subspace, use MCD to estimate a robust covariance matrix within the subspace.
See "Rediscovering SAS/IML Software" (Wicklin, 2010, p. 7-10) for SAS/IML programs that show how to conduct robust outlier detection and robust PCA. Your situation is similar, but you'll first project onto the eigenspace and then do the subsequent analysis in that coordinate system.
With all due respect, @Rick_SAS, I think my solution above is a lot easier.
Thaks for the quick response, Rick_SAS! Ha ha, I'm not so sure how great my command of linear algebra is, I forgot about the degree of freedom eaten by the sample mean that limits the maximum number of linearly independent covariance eigenvectors to n-1.
It would be nice if the Mahalanobis function had an option to do the projection coordinate change automatically.
Instead of changing the coordinates could the same effect be achieved by using a Moore-Penrose generalized inverse of the covariance matrix in the Mahalanobis formula? It would be easy to do in IML, but I am not quite confident enough to say that this would be equivalent to the approach you are describing.
It makes sense to me that the Moore-Penrose generalized inverse (or indeed any generalized inverse) ought to work in this situation ... however ... full disclaimer ... I have not tried to prove this is the case.
In essence, the PROC PLS solution I presented above is computing a generalized inverse.
I'm not sure why you say PROC PRINCOMP can't do this when p>n, it should work unless there's some restriction that I don't know about (mathematically it is possible). But I have never tried.
I do know that you can compute the T-squared (Mahalanobis Distance) from thie data in the case where p>n using PROC PLS, and in the OUTPUT statement from PROC PLS gives you the option to request TSQUARE and have it computed and stored in the output data set.
To obtain Principal Components Analysis and the T-squared value from PROC PLS, you need to make the X variables identical to the Y variables, for example like this, where the option nfac= should really use the number value of n, so if you have 50 data points and 100 variables, it would look like this (and naturally you would make the obvious change if n were some other number); and I have only typed in 5 variables in the model just to save myself some typing, naturally you would type all of your p variables there.
proc pls data=mydata nfac=50; model x1 x2 x3 x4 x5 = x1 x2 x3 x4 x5; output out=pls_out tsquare=tsq; run;
This seems a whole lot simpler than using PROC IML or MCD to do this.
Back to the original problem, which maybe I (and others?) have misunderstood.
@Top_Katz wrote:
Hi! I have a data set with more variables than observations. I would like to compute "Mahalanobis" distance between all pairs of points in the data, and between arbitrary new points from the "same population" (not just between the individual points and the location center estimator). Can the MCD call do this? Is there another way?
When n > p it is easy to do this for standard Mahalanobis distance in SAS/STAT using PROC PRINCOMP.
Thanks!
The idea of computing a Mahalanobis distance between pairs of variables is different than the ordinary Mahalanobis distance, which is the weighted distance of the data point from the center of the data (the weights are the inverse of the square roots of the eigenvalues). PROC PRINCOMP will compute the eigenvalues and eigenvectors even in the case where p>n.
So if you have the eigenvalues and eigenvectors, you can certainly write your own DATA step or IML function to compute these distances between pairs of data points (or between an individual data point and the center of the data) yourself; even in the case of a new values that were not in the original data set, the same formula applies, you don't have to re-write the formula in this case.
I think this is a good time to ask what you are trying to accomplish, @Top_Katz. What do you want to do with the MD between points? Are you trying to detect outliers? What are you trying to accomplish by computing the MD between points in the first set and the points in the other data set?
IMHO, the MD is undefined when the covariance matrix is singular. If you restrict your attention to coordinate in the (n-1)-dimensional subspace of the nonzero eigenvalues, then the MD does make sense in that subspace. The distance you get should be equivalent (or nearly) to using a generalized inverse, but it is expensive to compute a generalized inverse in p dimensions when you can use the SOLVE function in (n-1) dimensions.
Regarding the suitability of PLS, I will wait to see what the OP is trying to accomplish. However, I will mention that the T-square statistic is the MD from each point to the MEAN (center) of the data, not the MD between points. Also, I think scoring the PLS model on a new data will be problematic when the model is
MODEL x1-x5 = x1-x5;
because the usual way to score a PLS model on new data is to use "the missing value trick for responses," which is not feasible for this model.
Let's hear what the OP wants to accomplish.
@Rick_SAS wrote:
Also, I think scoring the PLS model on a new data will be problematic when the model is
MODEL x1-x5 = x1-x5;
because the usual way to score a PLS model on new data is to use "the missing value trick for responses," which is not feasible for this model.
Not true, @Rick_SAS.
If you want to use this missing value trick on new responses, there is a way to do this. For existing observations on which you want to build the PLS model, create a second set of variables on the Y side of the equation, x1a x2a x3a x4a x5a, which are identical to the variables on the X side of the equation x1 x2 x3 x4 x5. New observations would have these Y variables x1a x2a x3a x4a x5a values missing, while x1 x2 x3 x4 x5 of course obtain the actual values, and the scoring of the new observations is done as expected.
Hi! Yes, I definitely want pairwise distances between data points, not just distances from each point to the center, so if that's all the PROC PLS TSquare statistic yields, that won't be adequate for my purposes. It may be that I can use PROC PRINCOMP the usual way, even with the STD option. I just thought the STD option might not work in this case, because of the zero eigenvalues. I don't have the data yet, so I haven't had the opportunity to test things out. This is for a matched-pair experimental design, where the number of possible test / control units to choose from is somewhat modest, so the designers wanted to use a large variety of traits and match pairs by minimizing the distances between units based on these (standardized) traits. It occurred to me that the traits might be very highly correlated and yield a somewhat biased distance metric, so I proposed testing an alternate metric (of course we understand another possibility is to do some sort of variable selection / reduction winnowing process, so there's no need to point that out, unless you have specific suggestions). Thanks!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.