DATA Step, Macro, Functions and more

Fastest method to find some records in a dataset

Reply
N/A
Posts: 0

Fastest method to find some records in a dataset

Hi all,
can anybody tell me what is the fastest method to find some records from a dataset in another dataset?
I explain better: I have two dataset, the first one has 6000 records and differents columns; the second one has 500 records and just one column with the ID.
I need to select in the first dataset the records with the ID from the second dataset. I have tried with merge and inner join (the where in (,,,) woluld be crazy), but the program runs to slowly, around 4 minutes, then i'd want to know if there is another faster method to do this.

Thanks in advance,
Elena
Super User
Posts: 19,876

Re: Fastest method to find some records in a dataset

Posted in reply to deleted_user
try something like

proc sql;
create table want as
select * from have1
where id in (select id from have2);
quit;

I have no idea if that will be faster....but unless you have hundreds of text columns I'm not sure why it would take 4 minutes.
Valued Guide
Posts: 634

Re: Fastest method to find some records in a dataset

Posted in reply to deleted_user
For data sets of your size (assuming a reasonable number of columns) the SQL step ought to do the trick. If the SQL look-up is too slow, there are other alternatives that generally require more coding, but can usually process quicker. The DATA Step Components (HASH) Objects are about the fastest. An overview of look-up techniques can be found at:
http://caloxy.com/papers/43-i_how_table_lookups_from_ift.pdf
Super User
Posts: 10,046

Re: Fastest method to find some records in a dataset

Posted in reply to deleted_user
Hi.
I think you should first index these two dataset by using proc sort.then that wil be faster.
And if you do not want sort.;
there are some code i copy from somewhere.Wish will help you .

[pre]
proc sort data=small(keep=id) nodupkey force;*small is your second dataset;
by id;
run;

data fmt(rename=(id=start));
retain fmtname 'key'
type 'N' *'N' means numerical type format,if id is character type then use 'C';
label 'Y';
set small end=eof;
output;
if eof then do;
HLO='o';
label='N';
output;
end;

proc format cntlin=fmt;
run;

data matched;
set test; *test is your first dataset;
where put(id,key.)='Y';
run;
[/pre]

Ksharp

Message was edited by: Ksharp
Ask a Question
Discussion stats
  • 3 replies
  • 131 views
  • 0 likes
  • 4 in conversation