BookmarkSubscribeRSS Feed
A_SAS_Man
Pyrite | Level 9

I am attempting to find the nearest neighbors for all points in one data set from a second data set. I have found one potential workaround here: https://communities.sas.com/t5/Statistical-Procedures/Nearest-neighbour-between-two-datasets/td-p/12... from @PGStats that almost does what I want. The problem being that this will leave you with some observations who have less than the desired number of neighbors because some of the calculated nearest neighbors come from the wrong data set. In keeping with the same example linked above here is some example data:

 

data A B;
array v var1-var5;
call streaminit(56645);
do custId=1 to 200;
     do _n_ = 1 to 5;
           v{_n_} = rand("NORMAL");
           end;
     output A;
     end;
do custId=1 to 2000;
     do _n_ = 1 to 5;
          v{_n_} = rand("NORMAL");
          end;
     output B;
     end;
run;

In @PGStats solution he merges the two datasets, calculates the nearest neighbors and then removes those nearest neighbors that don't come from the correct data set. I am more looking for a method to say take a line in A, and find the k nearest neighbors in B. I would be highly biased towards a solution using proc modeclus if there's a workaround to do what I want because I have been running into a lot of issues where my organization doesn't have access to some of the other procs people normally use to calculate knn (proc iml for example), and I know that I have proc modeclus. Please let me know if any clarifications are needed on this.

2 REPLIES 2
ballardw
Super User

How do you intend to indicate K?

Will K vary for any of the potential pairs?

How many actual records/points to you have to work with in each set?

What definition of "distance" are using to determine 'nearest'?

 

And have you looked at Proc SPP?

A_SAS_Man
Pyrite | Level 9

1. I'm not sure what indicate k means? Is that how many nn I want? Edit: If this is correct I'm hoping to have 100+ NN but I'd like to do some analysis to try and find an ideal number so I'm open to starting anywhere.
2. Do you mean will the number of neighbors vary for a given point? No I would like each to calculate the same amount of them.
3. I'm going to have several hundred thousand plus (maybe up to a million depending on computational capabilities?) in my data set B and whatever the algorithm can handle in A.
4. I'm open to any definition of nearest, I'm using the default currently in modeclus which I believe is euclidean but would be open to cosine, manhattan etc if it offered me some benefit.

I have not checked that out, I will see if I have access to it.

 

Edit 2: As far as I can tell SPP doesn't seem to have KNN or K Means capabilities? Definitely going to require more reading on my part to see if it will work but if you have a specific idea for how you were thinking it would work for my problem i would be interested in hearing it.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 748 views
  • 0 likes
  • 2 in conversation