BookmarkSubscribeRSS Feed
Marilyn
Calcite | Level 5

Does anyone have a good idea on how to match an identifying field from one table to another table where the number may match any of 3 - 5 different identifying fields?  I tried using Proc SQL with several OR filters, but it is running forever.  I am already having to filter on a location field and date field as well and the table is very large Smiley Sad  They may also have multiple payments from that second table Smiley Sad

8 REPLIES 8
art297
Opal | Level 21

Your question will be easier to answer if you indicate:

1. How large (# records) is each file

2. Are there multiple records, in each file, for given identifying fields?

3. Sample data sets for both files

4. The file you want as a result of merging the two files and

5. The code you've already tried

Marilyn
Calcite | Level 5


I am using a pass thru SQL query. The one file has 5 different identifying fields for individual, spouse, children, etc.  The payment file I am trying to match to has only the one identifying field but it can match to any of the 5 identifying fields in the other table.  I don't have the luxury of trying to get any of the data tables changed.

The payment file has at least 20 million records and the other file has about 500k records...

Peter_C
Rhodochrosite | Level 12

try an SQL join with a where clause like

where find( '/'!! catx('/', b.name1, b.name2, b.name3, b.name4, b.name5)!!'/', '/'!! trim(a.name)!!'/', 'i' )

the syntax seeks to find NAME in file A anywhere in the concatenation of the 5 NAMEs in file B

The / delimiter is designed to protect against A.NAME matching a combination or substring of the B.names

The 'i' modifier is intended to make the search insensitive to case

I haven't tested this so be cautious

peterC

LinusH
Tourmaline | Level 20

In what format is your data stored?

You could consider moving it to SPDE, since it can better optimize OR filters together with indexes (as an addition to Arts wish list, how/are your tables indexed?).

It seems likely that your data needs restructuring.

/Linus

Data never sleeps
Peter_C
Rhodochrosite | Level 12

weight each "join" and use the total of these to indicate the likelihood of the match

ballardw
Super User

Without example code it is hard to optimize.

First thing is too make sure you are using things like LEFT JOIN instead of simple JOIN.

Ksharp
Super User

Without sample data and the output you want, it is hard to give you some advice.

Alpay
Fluorite | Level 6

You may try collecting different id variables under one variable. 500K obs would go up to 2500K obs (500 x 5).

proc sql;

    connect to odbc(dsn=<yourdsnname>);

    create table x as

    select *

    from connection to odbc(

        select a.*

        from largetable a

        inner join

        (select id1 id from smalltable

         UNION ALL

         select id2 id from smalltable

         UNION ALL

         select id3 id from smalltable

         UNION ALL

         select id4 id from smalltable

         UNION ALL

         select id5 id from smalltable

        ) b on a.id = b.id

    );

    disconnect from odbc;

quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 976 views
  • 0 likes
  • 7 in conversation