How to do efficient lookup to check if one observation in one table is in another table?

Reply
New Contributor
Posts: 2

How to do efficient lookup to check if one observation in one table is in another table?

I have a few very very large tables (10M rows+) and I want an efficient way to perform lookup and check if every observation in one table can correspond to another table.

 

Examples:

Table 1: ID1 ID2 A B C

Table 2: ID1 D E F

Table 3: ID2 G H K

 

I.e. I want to see if for every ID1 in Table 2 is in Table 1, and for every ID2 in Table 3 is in Table 1

 

I tried using SQL method of lookup but it is not doable in my computer. Anyone have any ideas of how to do this very efficiently?

Valued Guide
Posts: 518

Re: How to do efficient lookup to check if one observation in one table is in another table?

Have you checked out the data step HASH object? Smiley Happy

Esteemed Advisor
Posts: 6,698

Re: How to do efficient lookup to check if one observation in one table is in another table?

proc sort data=table1 (keep=id1) out=lu_1_1;
by id1;
run;

proc sort data=table1 (keep=id2) out=lu_1_2;
by id2;
run;

proc sort data=table2 (keep=id1) out=lu2;
by id1;
run;

proc sort data=table3 (keep=id2) out=lu3;
by id2;
run;

data miss1;
merge
  lu2 (in=a)
  lu_1_1 (in=b)
;
by id1;
if a and not b;
run;

data miss2;
merge
  lu3 (in=a)
  lu_1_2 (in=b)
;
by id2;
if a and not b;
run;

By using only the id variables in the sorts, you improve the performance of those steps.

 

As long as the tables fit into memory, using a hash object is of course another nice method.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Respected Advisor
Posts: 4,998

Re: How to do efficient lookup to check if one observation in one table is in another table?

I suspect that a solution that sorts and merges will be somewhat slow.  Hash tables would work, but so would formats.  And it's easier to program (in my opinion).  However, there are a few issues to consider before the programming begins.

 

Does Table 1 contain any duplicate values for either ID1 or ID2? 

 

Does it contain any missing values for either ID1 or ID2?  (If so, would that be considered a match if missing values appear in the other tables?)

New Contributor
Posts: 2

Re: How to do efficient lookup to check if one observation in one table is in another table?

ID1 should be unique in Table1, but ID2 is not. There are missing values, and it should not appear as match..

Respected Advisor
Posts: 3,840

Re: How to do efficient lookup to check if one observation in one table is in another table?

So what you basically want to achieve is that the larger table gets processed without any sorting and that the look-up to the small table happens ideally in-memory (hashing).

 

One way to go is a data step with a hash table. I believe you also get hashing if you code your SQL accordingly (there are some papers around this).

 

Are the lookup tables small enough to fit into memory?

 

What do you want to do if we find a record in the large table without a match to the lookup table? Do we have to write back something to the large table (update)?

If some sort of update is required then consider to do this in-place (using the modify or update statement) so we don't have to re-write the whole large base table.

 

 

 

Grand Advisor
Posts: 9,593

Re: How to do efficient lookup to check if one observation in one table is in another table?

1) hash table

 

2) sql

 

proc sql;

create table intersect  as

 select id1 from table1

intersect

 select id1 from table2;

quit;

 

 

Table intersect contains all the ID are in both table1 and table2 .

Grand Advisor
Posts: 10,239

Re: How to do efficient lookup to check if one observation in one table is in another table?

If the IDS repeat then look at DISTINCT to reduce the size of the problem.

If you really want to know which ones don't match then use EXCEPT;

 

Something like

proc sql;
   create table In1NotIn2 as
   select distinct Id from Table1
   except 
   select distinct id from Table2;
   Create table In2NotIn1 as
   elect distinct id from Table2
   except
   select distinct Id from Table1;
quit;
Ask a Question
Discussion stats
  • 7 replies
  • 359 views
  • 1 like
  • 7 in conversation