10-12-2016 07:18 AM
I have a few very very large tables (10M rows+) and I want an efficient way to perform lookup and check if every observation in one table can correspond to another table.
Table 1: ID1 ID2 A B C
Table 2: ID1 D E F
Table 3: ID2 G H K
I.e. I want to see if for every ID1 in Table 2 is in Table 1, and for every ID2 in Table 3 is in Table 1
I tried using SQL method of lookup but it is not doable in my computer. Anyone have any ideas of how to do this very efficiently?
10-12-2016 08:42 AM
proc sort data=table1 (keep=id1) out=lu_1_1; by id1; run; proc sort data=table1 (keep=id2) out=lu_1_2; by id2; run; proc sort data=table2 (keep=id1) out=lu2; by id1; run; proc sort data=table3 (keep=id2) out=lu3; by id2; run; data miss1; merge lu2 (in=a) lu_1_1 (in=b) ; by id1; if a and not b; run; data miss2; merge lu3 (in=a) lu_1_2 (in=b) ; by id2; if a and not b; run;
By using only the id variables in the sorts, you improve the performance of those steps.
As long as the tables fit into memory, using a hash object is of course another nice method.
10-12-2016 09:25 AM
I suspect that a solution that sorts and merges will be somewhat slow. Hash tables would work, but so would formats. And it's easier to program (in my opinion). However, there are a few issues to consider before the programming begins.
Does Table 1 contain any duplicate values for either ID1 or ID2?
Does it contain any missing values for either ID1 or ID2? (If so, would that be considered a match if missing values appear in the other tables?)
10-12-2016 09:44 AM
So what you basically want to achieve is that the larger table gets processed without any sorting and that the look-up to the small table happens ideally in-memory (hashing).
One way to go is a data step with a hash table. I believe you also get hashing if you code your SQL accordingly (there are some papers around this).
Are the lookup tables small enough to fit into memory?
What do you want to do if we find a record in the large table without a match to the lookup table? Do we have to write back something to the large table (update)?
If some sort of update is required then consider to do this in-place (using the modify or update statement) so we don't have to re-write the whole large base table.
10-12-2016 09:48 AM
1) hash table
create table intersect as
select id1 from table1
select id1 from table2;
Table intersect contains all the ID are in both table1 and table2 .
10-12-2016 10:28 AM
If the IDS repeat then look at DISTINCT to reduce the size of the problem.
If you really want to know which ones don't match then use EXCEPT;
proc sql; create table In1NotIn2 as select distinct Id from Table1 except select distinct id from Table2; Create table In2NotIn1 as elect distinct id from Table2 except select distinct Id from Table1; quit;