BookmarkSubscribeRSS Feed
hernan7800
Fluorite | Level 6

Hello everybody!

 

I have a table with 850k obs and 20 variables and I need to do this:

 

  1. For each observation, calculate the distance (one I have defined) between the observation and the rest of the observations. 
  2. Keep the subset of 1K closest neighbors.
  3. Do some calculations.
  4. Repeat this for each observation, that is 850K times.

 

To do this I wrote a program using a macro:

%macro d;

 

%do n=1 %to 850000;

 

Data _null_:

Set table850k;

if _n_=&n. THEN CALL SYMPUT(values for the “n” observation);

run;


data calc;
set table850k;
dist=
(
(geo1-&geo1.)**2+
(geo2-&geo2.)**2+
(geo3-&geo3.)**2+
(geo4-&geo4.)**2+
(geo5-&geo5.)**2+
(geo6-&geo6.)**2+
(geo7-&geo7.)**2+
(geo8-&geo8.)**2+
(geo9-&geo9.)**2+
(geo10-&geo10.)**2+
(geo11-&geo11.)**2+
(geo12-&geo12.)**2
)**0.5
;
run;

proc sort;
by dist;
run;

 

data calc2;
set calc;
if _n_ le 1000;
run;

 

PROC means DATA=calc2 NOPRINT ;
VAR raroc;
weight ce;
OUTPUT OUT=t3 N=N MEAN(raroc)=MEANr STD(raroc)=STDr p10(raroc)=p10r p25(raroc)=p25r p50(raroc)=p50r p75(raroc)=p75r p90(raroc)=p90r;
RUN;


DATA T3;
SET T3;
caso=&caso.;
raroc=&raroc.;
RUN;


DATA p.tacum3;
SET p.tacum3 T3;
RUN;

%end;

%mend;

 

 

This works perfectly, but it takes too long, basically because it is reading the 850k table in each iteration, and I have to do 850k iterations.

If the entire calculation would take 1 second for each obs, I would need 10 days to complete the job. And It takes 3 seconds… so I need help.

 

So the questions is: Is there any way to do this faster? I guess I have to rewrite the code doing this in memory with not physical reading of the big table…but how!

 

Thanks!

 

 

 

11 REPLIES 11
Astounding
PROC Star

Without knowing the calculations, just one part stands out.  The first DATA step could be revised:

 

data _null_;

set table850k (firstobs=&n obs=&n);

call symput(values for this observation);

run;

 

But that won't be enough savings.  You might have to share some of the details of the code.

 

Good luck.

hernan7800
Fluorite | Level 6

Thanks!

I've just added more detail to the post.

Thanks again!

Reeza
Super User
I would use a temporary array. Or a hash table. It's hard to suggest optimization beyond that without the code and/or sample data.
hernan7800
Fluorite | Level 6
Thanks!
I've just added more detail to the post.
Many thanks
ballardw
Super User

Does the point you are using for the reference location change 850K times? If not then you are doing way too many iterations.

JoshB
Quartz | Level 8

This looks like a standard Euclidean distance? Have you checked out proc distance? You can do Euclidean distance with or without standardization.

 

That being said, if you wanted to write something yourself I would also look at the hash table method suggested above to do this in memory or IML if you have access to that. If you think about these distances being calculated as entries in a 850000 x 850000 matrix, it will be symmetric (distance between 1 and 2 is same as distance between 2 and 1) so you maybe doing many more comparisons than are actually necessary. You should have sufficient distance information in one half of that hypothetical distance matrix.

hernan7800
Fluorite | Level 6

Ok, I will investigate about hash table. I don't even know what it is.

Thanks!

PGStats
Opal | Level 21

In theory, you can get the 1000 closest neighbors of every observation efficiently with proc modeclus :

 

ods _all_ close;
proc modeclus data=table850k dk=1001 neighbor;
var geo1-geo12;
id myObsId;
ods output Neighbor=myNeighbors;
run;
ods listing;

 

...in practice, it might run into memory space problems. I used it succesfully for finding 5 nearest neighbors among 200,000 positions defined by two variables.

PG
hernan7800
Fluorite | Level 6
Thanks, this is a better code, but It'd still take too long to iterate over 850K times...
Astounding
PROC Star

Given the detail you added, this would be a significant improvement for the final step:

 

proc append data=T3 base=p.tacum3;

run;

 

The savings become larger and larger as you iterate through the loop 850K times, so you may not notice as much in the way of savings if you test with a small number of iterations.

 

There may be significant value in some of the other approaches mentioned by other posters ... I'm more focused on your original code.  Good luck.

hernan7800
Fluorite | Level 6
Thanks, that will help.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 1968 views
  • 0 likes
  • 6 in conversation