BookmarkSubscribeRSS Feed
mkeintz
PROC Star

@deega

 

Are your VARS really integers from 0 to 5 as in your sample? 

 

If so, then instead of an algorithm to select points and determine which have the closest locations to a given point, you could systematically steps through the nearest possible locations to  see which  are occupined by a data point.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
deega
Quartz | Level 8
@mkeintz

No, the variables are not integers they are real numbers like 1.23, 0.28.....etc. but actually they were integers in different ranges like some variables were binary, some were in the range of 1 to 10, some were in the range of 30 to 70 etc and for creating clusters and finding distances I standardized the data and as a result data turned into decimal points.
Rick_SAS
SAS Super FREQ

What is the business problem that you are trying to solve?  Why do you think you need to compute nearest neighbors of 1M points?  Will an approximate solution suffice?

 

I think you are asking for a very lengthy computation. This computation requires computing (1E6)**2 = 1E12 distances.  

 

I suggest that you read my articles on computing NN neighbor distances. In one I show how to compute the k nearest neighbors by using PROC MODECLUS.  For your data example, the syntax is

 

ods select none;
ods output neighbor=Neighbor;
proc modeclus data=Have method=1 k=4 Neighbor;
   var Var:;
   ID S_No;
run;
ods select all;

I ran some tests on your 30-dimensional data. I estimate that you can compute the three nearest neighbors for 

30,000 obs in   1 minute

50,000 obs in 4.5 minutes

75,000 obs in 11.3minutes.

From these kinds of experiments and the fact that the computation is quadratic in the number of observations, you can predict that 1M observations would require about 40 hours to run in PROC MODECLUS, assuming adequate resources such as RAM.

 

Personally, I would ask whether it is possible to reformulate the problem. Work smarter, not harder. 

 

 

deega
Quartz | Level 8

Actually as per my business problem Jaccard Distance suits the best.

How practical is it to use the method of bubble sort while calculating Jaccard Distance, something like below:

Suppose there are 1M observations and I need three NN from these 1 M observations. If I calculate JD for obs1 & 2, then compare them, take the shortest and discard the other one. Then again compare the shortest of (1,2) with JD of 3, then again take the shortest.. and so on... 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 3219 views
  • 1 like
  • 6 in conversation