Matching payments to people

Marilyn · Posted 11-15-2011 05:50 PM

Does anyone have a good idea on how to match an identifying field from one table to another table where the number may match any of 3 - 5 different identifying fields? I tried using Proc SQL with several OR filters, but it is running forever. I am already having to filter on a location field and date field as well and the table is very large They may also have multiple payments from that second table

art297 · Posted 11-15-2011 05:54 PM

Your question will be easier to answer if you indicate:

1. How large (# records) is each file

2. Are there multiple records, in each file, for given identifying fields?

3. Sample data sets for both files

4. The file you want as a result of merging the two files and

5. The code you've already tried

Marilyn · Posted 06-08-2012 06:40 PM

I am using a pass thru SQL query. The one file has 5 different identifying fields for individual, spouse, children, etc. The payment file I am trying to match to has only the one identifying field but it can match to any of the 5 identifying fields in the other table. I don't have the luxury of trying to get any of the data tables changed.

The payment file has at least 20 million records and the other file has about 500k records...

Peter_C · Posted 06-08-2012 06:59 PM

try an SQL join with a where clause like

where find( '/'!! catx('/', b.name1, b.name2, b.name3, b.name4, b.name5)!!'/', '/'!! trim(a.name)!!'/', 'i' )

the syntax seeks to find NAME in file A anywhere in the concatenation of the 5 NAMEs in file B

The / delimiter is designed to protect against A.NAME matching a combination or substring of the B.names

The 'i' modifier is intended to make the search insensitive to case

I haven't tested this so be cautious

peterC

LinusH · Posted 11-16-2011 05:05 AM

In what format is your data stored?

You could consider moving it to SPDE, since it can better optimize OR filters together with indexes (as an addition to Arts wish list, how/are your tables indexed?).

It seems likely that your data needs restructuring.

/Linus

Data never sleeps

Peter_C · Posted 11-16-2011 06:31 AM

weight each "join" and use the total of these to indicate the likelihood of the match

ballardw · Posted 11-16-2011 02:03 PM

Without example code it is hard to optimize.

First thing is too make sure you are using things like LEFT JOIN instead of simple JOIN.

Ksharp · Posted 06-10-2012 09:59 PM

Without sample data and the output you want, it is hard to give you some advice.

Alpay · Posted 06-11-2012 03:52 AM

You may try collecting different id variables under one variable. 500K obs would go up to 2500K obs (500 x 5).

proc sql;

connect to odbc(dsn=<yourdsnname>);

create table x as

select *

from connection to odbc(

select a.*

from largetable a

inner join

(select id1 id from smalltable

UNION ALL

select id2 id from smalltable

UNION ALL

select id3 id from smalltable

UNION ALL

select id4 id from smalltable

UNION ALL

select id5 id from smalltable

) b on a.id = b.id

);

disconnect from odbc;

quit;