Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 05-26-2017 04:33 PM
(2083 views)

Hi! I have a data set with more variables than observations. I would like to compute "Mahalanobis" distance between all pairs of points in the data, and between arbitrary new points from the "same population" (not just between the individual points and the location center estimator). Can the MCD call do this? Is there another way?

When n > p it is easy to do this for standard Mahalanobis distance in SAS/STAT using PROC PRINCOMP.

Thanks!

14 REPLIES 14

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

http://blogs.sas.com/content/iml/2012/02/22/how-to-compute-mahalanobis-distance-in-sas.html

It seems like you can't using the standard functions, which means rolling your own. The post above shows how to do that for the old one so may be helpful in designing your own function and explicitly covers your situtation regarding distance between points, but not the case of N>p.

PS. I don't see why this isn't mathematically possible, but I question if it's a good idea when most procedures can't do it for a specific reason.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

SAS/IML supports the MAHALANOBIS function, which computes the Mahalanobis distance (MD) for a set of multivariate data.

The problem you will encounter is that when p > N, the sample covariance matrix is going to be singular because the data spans at most an N-dimensional subspace of p-dimensional space. You can see this by running the following program, which simulates N=5 observations in a 10-dimensonal space and prints out the Cholesky decomposition:

```
proc iml;
call randseed(12345);
p = 10;
N = 5;
mean = j(1,p,0);
cov = I(p);
X = randnormal(N, mean, cov); /* simulate MVN with p > N */
/* show that covariance matrix is singular */
S = cov(X);
G = root(S); /* Cholesky decomposition */
reset fuzz;
print G;
```

IF YOU KNOW THE POPULATION PARAMETERS, then you can solve the problem because then you don't need to estimate the covariance from the sample. The following uses the population mean and covariance to compute the Mahalanobis distances between observations in the sample:

```
md = mahalanobis(X, mean, cov);
print md;
```

In simulation studies, you know the population parameters, but in real life you usually don't. It doesn't matter whether you use a classical estimator or a robust estimator (like MCD), the estimates of a general covariance matrix with p > N will be singular.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Rick_SAS! Thank you for responding. Can MCD do anything meaningful for p > n? All determinants are going to be zero anyway. I thought maybe it would do some shrinkage with a small diagonal. I suppose I could do that manually, perhaps create a "Cauchy sequence" of smaller and smaller diagonals, compute Mahalanobis and / or robust distances at each value and see whether they converge or blow up.

If I have a full-rank covariance matrix, with eigenvalues t[j] and normalized eigenvectors u[j], 1 <= j <= p, then for any vectors x and y in my data set, I can express (x-y) as a linear combination of the eigenvectors, x-y = c[1]*u[1] + c[2]*u[2] + ... + c[p]*u[p]. Then the square of the Euclidean distance between x and y is (c[1]**2 + c[2]**2 + ... + c[p]**2), and the square of the Mahalanobis distance is (c[1]**2/t[1] + c[2]**2/t[2] + ... + c[p]**2/t[p]). For p > n, the covariance matrix is not full-rank, but suppose it has maximum rank n. Then the n data vectors are part of an n-dimensional subspace, I should be able to produce something like a Mahalanobis distance in this reduced subspace; after all, I can always compute Euclidean distance, and Mahalanobis just rescales the Euclidean axes. In particular, if the n-dimensional subspace is spanned by the covariance eigenvectors with positive eigenvalues, I could just use the Mahalanobis formula with the non-zero eigenvalues. (It doesn't seem trivial to prove or disprove this conjecture about the eigenvectors, although I would think that the question was settled long ago.)

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You seem to have a solid understanding of linear algebra. You recognize that if p > n then there are at most n-1 linearly independent eigenvectors, which I will call the eigenbasis. Here's what I would suggest:

1. Change coordinates. Represent the data in terms of the projection onto the span of the eigenbasis. In this coordinate system, each observation has n-1 coordinates. [Statisticians call these coordinate the "principal coordinate scores."]

2. Do a regular Mahalanobis distance analysis in this subspace. Or, if there are outliers in this subspace, use MCD to estimate a robust covariance matrix within the subspace.

See "Rediscovering SAS/IML Software" (Wicklin, 2010, p. 7-10) for SAS/IML programs that show how to conduct robust outlier detection and robust PCA. Your situation is similar, but you'll first project onto the eigenspace and then do the subsequent analysis in that coordinate system.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Rick_SAS wrote:

You seem to have a solid understanding of linear algebra. You recognize that if p > n then there are at most n-1 linearly independent eigenvectors, which I will call the eigenbasis. Here's what I would suggest:

1. Change coordinates. Represent the data in terms of the projection onto the span of the eigenbasis. In this coordinate system, each observation has n-1 coordinates. [Statisticians call these coordinate the "principal coordinate scores."]

2. Do a regular Mahalanobis distance analysis in this subspace. Or, if there are outliers in this subspace, use MCD to estimate a robust covariance matrix within the subspace.

See "Rediscovering SAS/IML Software" (Wicklin, 2010, p. 7-10) for SAS/IML programs that show how to conduct robust outlier detection and robust PCA. Your situation is similar, but you'll first project onto the eigenspace and then do the subsequent analysis in that coordinate system.

With all due respect, @Rick_SAS, I think my solution above is a lot easier.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thaks for the quick response, Rick_SAS! Ha ha, I'm not so sure how great my command of linear algebra is, I forgot about the degree of freedom eaten by the sample mean that limits the maximum number of linearly independent covariance eigenvectors to n-1.

It would be nice if the Mahalanobis function had an option to do the projection coordinate change automatically.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It makes sense to me that the Moore-Penrose generalized inverse (or indeed any generalized inverse) ought to work in this situation ... however ... full disclaimer ... I have not tried to prove this is the case.

In essence, the PROC PLS solution I presented above is computing a generalized inverse.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I'm not sure why you say PROC PRINCOMP can't do this when p>n, it should work unless there's some restriction that I don't know about (mathematically it is possible). But I have never tried.

I do know that you can compute the T-squared (Mahalanobis Distance) from thie data in the case where p>n using PROC PLS, and in the OUTPUT statement from PROC PLS gives you the option to request TSQUARE and have it computed and stored in the output data set.

To obtain Principal Components Analysis and the T-squared value from PROC PLS, you need to make the X variables identical to the Y variables, for example like this, where the option nfac= should really use the number value of n, so if you have 50 data points and 100 variables, it would look like this (and naturally you would make the obvious change if n were some other number); and I have only typed in 5 variables in the model just to save myself some typing, naturally you would type all of your p variables there.

proc pls data=mydata nfac=50; model x1 x2 x3 x4 x5 = x1 x2 x3 x4 x5; output out=pls_out tsquare=tsq; run;

This seems a whole lot simpler than using PROC IML or MCD to do this.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Wow, thanks a lot, Paige Miller, this looks like a great idea!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Back to the original problem, which maybe I (and others?) have misunderstood.

@Top_Katz wrote:

Hi! I have a data set with more variables than observations. I would like to compute "Mahalanobis" distance between all pairs of points in the data, and between arbitrary new points from the "same population" (not just between the individual points and the location center estimator). Can the MCD call do this? Is there another way?

When n > p it is easy to do this for standard Mahalanobis distance in SAS/STAT using PROC PRINCOMP.

Thanks!

The idea of computing a Mahalanobis distance between pairs of variables is different than the ordinary Mahalanobis distance, which is the weighted distance of the data point from the center of the data (the weights are the inverse of the square roots of the eigenvalues). PROC PRINCOMP will compute the eigenvalues and eigenvectors even in the case where p>n.

So if you have the eigenvalues and eigenvectors, you can certainly write your own DATA step or IML function to compute these distances between pairs of data points (or between an individual data point and the center of the data) yourself; even in the case of a new values that were not in the original data set, the same formula applies, you don't have to re-write the formula in this case.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I think this is a good time to ask what you are trying to accomplish, @Top_Katz. What do you want to do with the MD between points? Are you trying to detect outliers? What are you trying to accomplish by computing the MD between points in the first set and the points in the other data set?

IMHO, the MD is undefined when the covariance matrix is singular. If you restrict your attention to coordinate in the (n-1)-dimensional subspace of the nonzero eigenvalues, then the MD does make sense *in that subspace*. The distance you get should be equivalent (or nearly) to using a generalized inverse, but it is expensive to compute a generalized inverse in p dimensions when you can use the SOLVE function in (n-1) dimensions.

Regarding the suitability of PLS, I will wait to see what the OP is trying to accomplish. However, I will mention that the T-square statistic is the MD from each point to the MEAN (center) of the data, not the MD between points. Also, I think scoring the PLS model on a new data will be problematic when the model is

MODEL x1-x5 = x1-x5;

because the usual way to score a PLS model on new data is to use "the missing value trick for responses," which is not feasible for this model.

Let's hear what the OP wants to accomplish.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Rick_SAS wrote:

Also, I think scoring the PLS model on a new data will be problematic when the model is

MODEL x1-x5 = x1-x5;

because the usual way to score a PLS model on new data is to use "the missing value trick for responses," which is not feasible for this model.

Not true, @Rick_SAS.

If you want to use this missing value trick on new responses, there is a way to do this. For existing observations on which you want to build the PLS model, create a second set of variables on the Y side of the equation, x1a x2a x3a x4a x5a, which are identical to the variables on the X side of the equation x1 x2 x3 x4 x5. New observations would have these Y variables x1a x2a x3a x4a x5a values missing, while x1 x2 x3 x4 x5 of course obtain the actual values, and the scoring of the new observations is done as expected.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.