turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-12-2014 06:08 AM

Suppose I have data set that look like this:

data A;

input x y;

datalines;

23 65

34 13

32 54

43 32

65 21

64 34

;

run;

data B;

input x y;

datalines;

23 43

17 84

;

run;

Let the dataset “A” has 6 observations and a data set “B” has 2 observations. X and Y are variables in these two data sets. I like to

I) form the SUB subsets from A which is the combination of 6 things taken 2 at a time (6!/(4!2!)=15 distint subsets).

II) calculated the malahanobis distance between the 2 observations in B and each of the 15 subsets of A in terms of X and Y.

III) select the subset from A with the smallest distance—call this subset MINDIST.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SWEETSAS

12-12-2014 06:43 AM

The Mahalanobis distance assumes a center and covariance matrix for the computation. Please show how you want to compute those parameters.

Since this pocess is reminiscent of robust regression methods, you might want to see whether SAS/IML (or PROC ROBUSTREG) already contains the algorithm that you want. See the description of the MCD and MVE algorithms.

If you decide to proceed "by hand," the relevant functions you will need are

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Rick_SAS

12-12-2014 07:52 AM

Thanks for the prompt reply Rick.

given two covariate **xi** and **xj**, the desired formaula for mahalanobis distance md={[**xi**-**xj**]^TS^(-1)[**xi**-**xj**]}^(1/2), where S^1/2 is the cholesky decomposition of **X** covariance matrix. In the above example xi=x;xj=y.

Please, does the robustreg procedure allow for computing the distance between variables other than continuous, since that requires scoring? I will appreciate a demonstration of how to output the subset with minimum distance.

Thanks,

Jack

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SWEETSAS

12-12-2014 09:15 AM

Can you explain more about Step 3? Each subset of A contains two observations. For concreteness, let the first subset be

the points

23 65

34 13

I compute the Mahalanobis distances (MD) to each point in B. I get

MD to B[1,]:

1.34

1.54

MD to B[2,]:

0.99

3.80

Then what? What do you consider to be "the distance between the two observations in B and the subset of A"?

Here's some code:

proc iml;

use A; read all var _num_ into Z; close;

cov = cov(Z);

/* let m be first subset = first 2 obs */

m = Z[1:2,];

use B;

do data;

read next var _num_ into center;

md = mahalanobis(m, center, cov); /* distance from each pt of A to each pt of B */

print md;

end;

close B;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Rick_SAS

12-12-2014 10:25 AM

Thanks Rick!

In step 2, I will compute the mahalanobis distance between all the subsets in A with the data set in B.

In step 3, I want to identify the subsets of A of all the 15 distinct possible subset that has the minimum mahalanobis distance. Essentially, the subset of A that has the minimum mahalanobis distance with B will be the most closely match with B. In order, words I am looking to select the subset of A that best match B.

The aim of the whole exercise is to enumerate all distinct subsets A based on the number of observations in B. Then of all the subset of A, which of the subsets is most similar (closest) to B in based on mahalanobis distance. In the example given, data set A has 6 observations and data set B has 2 observations. There are 6 choose 2 (15 subsets) distinct ways of enumerating data set with 6 observations chosen 2 at a time. But I want to identify and select the subset among all the 15 possible subsets of A that is most similar to B. I want to measure the similarity based on mahalanobis distance.This is similar to identifying the subset in A that is the best match for B.

*Alternatively, I can take each observation from B and find the observation in A that most closely matched using mahalanobis distance.*

Thanks

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SWEETSAS

12-12-2014 10:34 AM

Yes, but the phrase "I will compute the mahalanobis distance between all the subsets in A with the data set in B" does not make sense unless you define the distance between two subsets. I only know how to compute the distance between points. In the example I provided, what is the distance between the first subset of A and the data set B?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Rick_SAS

12-14-2014 10:20 AM

Thanks Rick. Your question was very helpful. It made me rethink the problem. The reference point is the midpoint between the two subsets. The distance between two sets each conisisting of two pairs of points my be thought of as the midpoints. In the case of a circle the diameter between the pairs of point will be the distance.