Hello everybody! I have a table with 850k obs and 20 variables and I need to do this: For each observation, calculate the distance (one I have defined) between the observation and the rest of the observations. Keep the subset of 1K closest neighbors. Do some calculations. Repeat this for each observation, that is 850K times. To do this I wrote a program using a macro: %macro d; %do n=1 %to 850000; Data _null_: Set table850k; if _n_=&n. THEN CALL SYMPUT(values for the “n” observation); run; data calc; set table850k; dist= ( (geo1-&geo1.)**2+ (geo2-&geo2.)**2+ (geo3-&geo3.)**2+ (geo4-&geo4.)**2+ (geo5-&geo5.)**2+ (geo6-&geo6.)**2+ (geo7-&geo7.)**2+ (geo8-&geo8.)**2+ (geo9-&geo9.)**2+ (geo10-&geo10.)**2+ (geo11-&geo11.)**2+ (geo12-&geo12.)**2 )**0.5 ; run; proc sort; by dist; run; data calc2; set calc; if _n_ le 1000; run; PROC means DATA=calc2 NOPRINT ; VAR raroc; weight ce; OUTPUT OUT=t3 N=N MEAN(raroc)=MEANr STD(raroc)=STDr p10(raroc)=p10r p25(raroc)=p25r p50(raroc)=p50r p75(raroc)=p75r p90(raroc)=p90r; RUN; DATA T3; SET T3; caso=&caso.; raroc=&raroc.; RUN; DATA p.tacum3; SET p.tacum3 T3; RUN; %end; %mend; This works perfectly, but it takes too long, basically because it is reading the 850k table in each iteration, and I have to do 850k iterations. If the entire calculation would take 1 second for each obs, I would need 10 days to complete the job. And It takes 3 seconds… so I need help. So the questions is: Is there any way to do this faster? I guess I have to rewrite the code doing this in memory with not physical reading of the big table…but how! Thanks!
... View more