BookmarkSubscribeRSS Feed
mkeintz
PROC Star

@deega

 

Are your VARS really integers from 0 to 5 as in your sample? 

 

If so, then instead of an algorithm to select points and determine which have the closest locations to a given point, you could systematically steps through the nearest possible locations to  see which  are occupined by a data point.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
deega
Quartz | Level 8
@mkeintz

No, the variables are not integers they are real numbers like 1.23, 0.28.....etc. but actually they were integers in different ranges like some variables were binary, some were in the range of 1 to 10, some were in the range of 30 to 70 etc and for creating clusters and finding distances I standardized the data and as a result data turned into decimal points.
Rick_SAS
SAS Super FREQ

What is the business problem that you are trying to solve?  Why do you think you need to compute nearest neighbors of 1M points?  Will an approximate solution suffice?

 

I think you are asking for a very lengthy computation. This computation requires computing (1E6)**2 = 1E12 distances.  

 

I suggest that you read my articles on computing NN neighbor distances. In one I show how to compute the k nearest neighbors by using PROC MODECLUS.  For your data example, the syntax is

 

ods select none;
ods output neighbor=Neighbor;
proc modeclus data=Have method=1 k=4 Neighbor;
   var Var:;
   ID S_No;
run;
ods select all;

I ran some tests on your 30-dimensional data. I estimate that you can compute the three nearest neighbors for 

30,000 obs in   1 minute

50,000 obs in 4.5 minutes

75,000 obs in 11.3minutes.

From these kinds of experiments and the fact that the computation is quadratic in the number of observations, you can predict that 1M observations would require about 40 hours to run in PROC MODECLUS, assuming adequate resources such as RAM.

 

Personally, I would ask whether it is possible to reformulate the problem. Work smarter, not harder. 

 

 

deega
Quartz | Level 8

Actually as per my business problem Jaccard Distance suits the best.

How practical is it to use the method of bubble sort while calculating Jaccard Distance, something like below:

Suppose there are 1M observations and I need three NN from these 1 M observations. If I calculate JD for obs1 & 2, then compare them, take the shortest and discard the other one. Then again compare the shortest of (1,2) with JD of 3, then again take the shortest.. and so on... 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 4745 views
  • 1 like
  • 6 in conversation