Does anyone have a good idea on how to match an identifying field from one table to another table where the number may match any of 3 - 5 different identifying fields? I tried using Proc SQL with several OR filters, but it is running forever. I am already having to filter on a location field and date field as well and the table is very large They may also have multiple payments from that second table
Your question will be easier to answer if you indicate:
1. How large (# records) is each file
2. Are there multiple records, in each file, for given identifying fields?
3. Sample data sets for both files
4. The file you want as a result of merging the two files and
5. The code you've already tried
I am using a pass thru SQL query. The one file has 5 different identifying fields for individual, spouse, children, etc. The payment file I am trying to match to has only the one identifying field but it can match to any of the 5 identifying fields in the other table. I don't have the luxury of trying to get any of the data tables changed.
The payment file has at least 20 million records and the other file has about 500k records...
try an SQL join with a where clause like
where find( '/'!! catx('/', b.name1, b.name2, b.name3, b.name4, b.name5)!!'/', '/'!! trim(a.name)!!'/', 'i' )
the syntax seeks to find NAME in file A anywhere in the concatenation of the 5 NAMEs in file B
The / delimiter is designed to protect against A.NAME matching a combination or substring of the B.names
The 'i' modifier is intended to make the search insensitive to case
I haven't tested this so be cautious
peterC
In what format is your data stored?
You could consider moving it to SPDE, since it can better optimize OR filters together with indexes (as an addition to Arts wish list, how/are your tables indexed?).
It seems likely that your data needs restructuring.
/Linus
weight each "join" and use the total of these to indicate the likelihood of the match
Without example code it is hard to optimize.
First thing is too make sure you are using things like LEFT JOIN instead of simple JOIN.
Without sample data and the output you want, it is hard to give you some advice.
You may try collecting different id variables under one variable. 500K obs would go up to 2500K obs (500 x 5).
proc sql;
connect to odbc(dsn=<yourdsnname>);
create table x as
select *
from connection to odbc(
select a.*
from largetable a
inner join
(select id1 id from smalltable
UNION ALL
select id2 id from smalltable
UNION ALL
select id3 id from smalltable
UNION ALL
select id4 id from smalltable
UNION ALL
select id5 id from smalltable
) b on a.id = b.id
);
disconnect from odbc;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.