Are your VARS really integers from 0 to 5 as in your sample?
If so, then instead of an algorithm to select points and determine which have the closest locations to a given point, you could systematically steps through the nearest possible locations to see which are occupined by a data point.
What is the business problem that you are trying to solve? Why do you think you need to compute nearest neighbors of 1M points? Will an approximate solution suffice?
I think you are asking for a very lengthy computation. This computation requires computing (1E6)**2 = 1E12 distances.
I suggest that you read my articles on computing NN neighbor distances. In one I show how to compute the k nearest neighbors by using PROC MODECLUS. For your data example, the syntax is
ods select none;
ods output neighbor=Neighbor;
proc modeclus data=Have method=1 k=4 Neighbor;
var Var:;
ID S_No;
run;
ods select all;
I ran some tests on your 30-dimensional data. I estimate that you can compute the three nearest neighbors for
30,000 obs in 1 minute
50,000 obs in 4.5 minutes
75,000 obs in 11.3minutes.
From these kinds of experiments and the fact that the computation is quadratic in the number of observations, you can predict that 1M observations would require about 40 hours to run in PROC MODECLUS, assuming adequate resources such as RAM.
Personally, I would ask whether it is possible to reformulate the problem. Work smarter, not harder.
Actually as per my business problem Jaccard Distance suits the best.
How practical is it to use the method of bubble sort while calculating Jaccard Distance, something like below:
Suppose there are 1M observations and I need three NN from these 1 M observations. If I calculate JD for obs1 & 2, then compare them, take the shortest and discard the other one. Then again compare the shortest of (1,2) with JD of 3, then again take the shortest.. and so on...
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.