Calculating the nearest neighbors of an observation on data in a diffe...

A_SAS_Man · Posted 12-05-2019 12:45 PM

I am attempting to find the nearest neighbors for all points in one data set from a second data set. I have found one potential workaround here: https://communities.sas.com/t5/Statistical-Procedures/Nearest-neighbour-between-two-datasets/td-p/12... from @PGStats that almost does what I want. The problem being that this will leave you with some observations who have less than the desired number of neighbors because some of the calculated nearest neighbors come from the wrong data set. In keeping with the same example linked above here is some example data:

data A B;
array v var1-var5;
call streaminit(56645);
do custId=1 to 200;
     do _n_ = 1 to 5;
           v{_n_} = rand("NORMAL");
           end;
     output A;
     end;
do custId=1 to 2000;
     do _n_ = 1 to 5;
          v{_n_} = rand("NORMAL");
          end;
     output B;
     end;
run;

In @PGStats solution he merges the two datasets, calculates the nearest neighbors and then removes those nearest neighbors that don't come from the correct data set. I am more looking for a method to say take a line in A, and find the k nearest neighbors in B. I would be highly biased towards a solution using proc modeclus if there's a workaround to do what I want because I have been running into a lot of issues where my organization doesn't have access to some of the other procs people normally use to calculate knn (proc iml for example), and I know that I have proc modeclus. Please let me know if any clarifications are needed on this.

ballardw · Posted 12-05-2019 01:46 PM

How do you intend to indicate K?

Will K vary for any of the potential pairs?

How many actual records/points to you have to work with in each set?

What definition of "distance" are using to determine 'nearest'?

And have you looked at Proc SPP?

A_SAS_Man · Posted 12-05-2019 01:53 PM

1. I'm not sure what indicate k means? Is that how many nn I want? Edit: If this is correct I'm hoping to have 100+ NN but I'd like to do some analysis to try and find an ideal number so I'm open to starting anywhere.
2. Do you mean will the number of neighbors vary for a given point? No I would like each to calculate the same amount of them.
3. I'm going to have several hundred thousand plus (maybe up to a million depending on computational capabilities?) in my data set B and whatever the algorithm can handle in A.
4. I'm open to any definition of nearest, I'm using the default currently in modeclus which I believe is euclidean but would be open to cosine, manhattan etc if it offered me some benefit.

I have not checked that out, I will see if I have access to it.

Edit 2: As far as I can tell SPP doesn't seem to have KNN or K Means capabilities? Definitely going to require more reading on my part to see if it will work but if you have a specific idea for how you were thinking it would work for my problem i would be interested in hearing it.

Calculating the nearest neighbors of an observation on data in a different set

Re: Calculating the nearest neighbors of an observation on data in a different set

Re: Calculating the nearest neighbors of an observation on data in a different set

Calculating the nearest neighbors of an observation on data in a different set

Re: Calculating the nearest neighbors of an observation on data in a different set

Re: Calculating the nearest neighbors of an observation on data in a different set

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!