topic Re: Faster way to do this in data step? in SAS Procedures

Faster way to do this in data step?

deleted_user — Wed, 12 Aug 2009 13:59:59 GMT

Sorry for the noobish question. I'm a sql background trying to get a handle on SAS...

So I've been adding a variable based on matching observations to a second table...

proc sql;
create table matching as
select a.*, b.id_final_new as match_id
from people a left join responders b
on a.id_one=b.final_id_one and a.id_two=b.final_id_two;
quit;

This takes forever! Can I do the same in a quick data step?

Re: Faster way to do this in data step?

LAP — Wed, 12 Aug 2009 15:34:52 GMT

I have found similiar problems with SQL joins of large data sets. While the data step is usually faster, merging requires that the joined data be presorted. I have have began experimenting using Hash tables in the data step to join records. While I'm not comfortable fully explaining them, they have provided the performance enhancements that I wanted. There are many good SAS Global forum papers that explain and give examples. Just search hash tables in tech support search.

Here's one paper....

http://www2.sas.com/proceedings/forum2008/029-2008.pdf

Re: Faster way to do this in data step?

deleted_user — Wed, 12 Aug 2009 20:03:05 GMT

Thanks, looks like I've got a lot of reading to do...

Re: Faster way to do this in data step?

Oleg_L — Thu, 13 Aug 2009 04:42:06 GMT

You can test this code. If dataset people is not very large then should work very fast.

data responders(keep=id_one id_two id_final index=(cr1=(id_one id_two)));
set responders(rename=(final_id_one=id_one final_id_two=id_two));
run;

data matching;
set people;
set responders key=cr1/unique;
select(_iorc_);
when(%sysrc(_sok)) do; end;
when(%sysrc(_dsenom)) do; id_final= . /* if id_final is char then should be id_final=''*/; _error_ = 0; end;
otherwise do;
put 'ERROR_: Unexpected value for _IORC_= ' _iorc_ ' Program terminating. Data set accessed is responders';
put _all_; _error_ = 0; stop; end;
end;
run;

Re: Faster way to do this in data step?

ChrisNZ — Thu, 13 Aug 2009 05:31:00 GMT

You could investigate using a hash table in a data step.
While the syntax is unusual at first glance, it is quite easy to grasp.
Hash tables are fast as they are loaded in memory, and don't require any prior sorting.
The pseudo code looks like:
data matching ;
if _N_=1 then load hash table;
set base table;
find match;
run;