Re: shortest distance

mkeintz · Posted 01-25-2017 10:58 PM

Are your VARS really integers from 0 to 5 as in your sample?

If so, then instead of an algorithm to select points and determine which have the closest locations to a given point, you could systematically steps through the nearest possible locations to see which are occupined by a data point.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

deega · Posted 01-25-2017 11:49 PM

@mkeintz

No, the variables are not integers they are real numbers like 1.23, 0.28.....etc. but actually they were integers in different ranges like some variables were binary, some were in the range of 1 to 10, some were in the range of 30 to 70 etc and for creating clusters and finding distances I standardized the data and as a result data turned into decimal points.

Rick_SAS · Posted 01-26-2017 09:41 AM

What is the business problem that you are trying to solve? Why do you think you need to compute nearest neighbors of 1M points? Will an approximate solution suffice?

I think you are asking for a very lengthy computation. This computation requires computing (1E6)**2 = 1E12 distances.

I suggest that you read my articles on computing NN neighbor distances. In one I show how to compute the k nearest neighbors by using PROC MODECLUS. For your data example, the syntax is

ods select none;
ods output neighbor=Neighbor;
proc modeclus data=Have method=1 k=4 Neighbor;
   var Var:;
   ID S_No;
run;
ods select all;

I ran some tests on your 30-dimensional data. I estimate that you can compute the three nearest neighbors for

30,000 obs in 1 minute

50,000 obs in 4.5 minutes

75,000 obs in 11.3minutes.

From these kinds of experiments and the fact that the computation is quadratic in the number of observations, you can predict that 1M observations would require about 40 hours to run in PROC MODECLUS, assuming adequate resources such as RAM.

Personally, I would ask whether it is possible to reformulate the problem. Work smarter, not harder.

deega · Posted 02-06-2017 03:52 AM

Actually as per my business problem Jaccard Distance suits the best.

How practical is it to use the method of bubble sort while calculating Jaccard Distance, something like below:

Suppose there are 1M observations and I need three NN from these 1 M observations. If I calculate JD for obs1 & 2, then compare them, take the shortest and discard the other one. Then again compare the shortest of (1,2) with JD of 3, then again take the shortest.. and so on...