## Selecting subset with smallest distance

Suppose I have data set that look like this:

data A;

input x y;

datalines;

23 65

34 13

32 54

43 32

65 21

64 34

;

run;

data B;

input x y;

datalines;

23 43

17 84

;

run;

Let the dataset “A” has 6 observations and a data set “B” has 2 observations.  X and Y are variables in these two data sets. I like to

I)                form the SUB subsets from A which is the combination of 6 things taken 2 at a time  (6!/(4!2!)=15 distint subsets).

II)       calculated the malahanobis distance between the 2 observations in B and each of the 15 subsets of A in terms of X and Y.

III)        select the subset from A with the smallest distance—call this subset MINDIST.

6 REPLIES 6

## Re: Selecting subset with smallest distance

The Mahalanobis distance assumes a center and covariance matrix for the computation.  Please show how you want to compute those parameters.

Since this pocess is reminiscent of robust regression methods, you might want to see whether SAS/IML (or PROC ROBUSTREG)  already contains the algorithm that you want. See the description of the MCD and MVE algorithms.

If you decide to proceed "by hand," the relevant functions you will need are

The ALLCOMB function

The MAHALANOBIS function

## Re: Selecting subset with smallest distance

Thanks for the prompt reply Rick.

given two covariate xi and xj, the desired formaula for mahalanobis distance md={[xi-xj]^TS^(-1)[xi-xj]}^(1/2), where S^1/2 is the cholesky decomposition of X covariance matrix. In the above example xi=x;xj=y.

Please, does the robustreg procedure allow for computing the distance between variables other than continuous, since that requires scoring?  I will appreciate a demonstration of how to output the subset with minimum distance.

Thanks,

Jack

## Re: Selecting subset with smallest distance

Can you explain more about Step 3?  Each subset of A contains two observations. For concreteness, let the first subset be

the points

23 65

34 13

I compute the Mahalanobis distances (MD) to each point in B.  I get

MD to B[1,]:

1.34

1.54

MD to B[2,]:

0.99

3.80

Then what? What do you consider to be "the distance between the two observations in B and the subset of A"?

Here's some code:

proc iml;
use A;  read all var _num_ into Z;  close;

cov = cov(Z);
/* let m be first subset = first 2 obs */
m = Z[1:2,];

use B;
do data;
read next var _num_ into center;
md = mahalanobis(m, center, cov); /* distance from each pt of A to each pt of B */
print md;
end;
close B;

## Re: Selecting subset with smallest distance

Thanks Rick!

In step 2, I will compute the mahalanobis distance between all the subsets in A with the data set in B.

In step 3, I want to identify the subsets of A of all the 15 distinct possible subset that has the minimum mahalanobis distance. Essentially,  the subset of A that has the minimum mahalanobis distance with B will be the most closely match with B. In order, words I am looking to select the subset of A that best match B.

The aim of the whole exercise is to enumerate all distinct subsets A based on the number of observations in B. Then of all the subset of A, which of the subsets is most similar (closest) to B in based on mahalanobis distance. In the example given, data set A has 6 observations and data set B has 2 observations.  There are 6 choose 2 (15 subsets) distinct ways of enumerating data set with 6 observations chosen 2 at a time. But I want to identify and select the subset among all the 15 possible subsets of A that is most similar to B. I want to measure the similarity based on mahalanobis distance.This is similar to identifying the subset in A that is the best match for B.

Alternatively, I can take each observation from B and find the observation in A that most closely matched using mahalanobis distance.

Thanks

## Re: Selecting subset with smallest distance

Yes, but the phrase "I will compute the mahalanobis distance between all the subsets in A with the data set in B" does not make sense unless you define the distance between two subsets.  I only know how to compute the distance between points. In the example I provided, what is the distance between the first subset of A and the data set B?

## Re: Selecting subset with smallest distance

Thanks Rick. Your question was very helpful. It made me rethink the problem. The reference point is the midpoint between the two subsets. The distance between two sets each conisisting of two pairs of points my be thought of as the midpoints. In the case of a circle the diameter between the pairs of point will be the distance.

From The DO Loop