DATA Step, Macro, Functions and more

How to join the 2 very large datasets with duplicate id keys in both the datasets

Reply
Contributor
Posts: 46

How to join the 2 very large datasets with duplicate id keys in both the datasets

How to join the 2 very large datasets with duplicate id keys in both the datasets

 

For example, I have the following datasets

 

table1 :

 

acct_no    a                  b                  c                  d

a1                4                  7                  2                  8

a3                5                  32                8                  12

a5                42                12                54                65

a1                5                  2                  17                23

 

acct_no is the key with duplicates because it is a transaction dataset.

 

table 2:

acct_no    card_no

a4                54

a3                31

a4                43

a1                24

a8                12

a7                23

a8                45

 

acct_no in table 2 also has duplicates because the logic is that a single customer can have more than one card, one a primary card and the other being secondary card

 

Can anybody help me with the logic for the join?

 

The datasets contain 400 million records and are pretty wide and long, however i need to fetch only the card_no from table2, so table 1 would be my left table. Sorting will not help either

 

Super User
Posts: 19,780

Re: How to join the 2 very large datasets with duplicate id keys in both the datasets

Posted in reply to Allaluiah

You should post sample output to illustrate your logic. For example how do you want to deal with duplicate card no in table 2? Bring both/all in? 

Super User
Posts: 5,426

Re: How to join the 2 very large datasets with duplicate id keys in both the datasets

Posted in reply to Allaluiah

First, it seems that your database is not "properly" modelled. You should have a separate account table, and a separate card table.

Then you are talking about the primary card, but there isn't such attribute in your example.

To be able to join those tables in a proper way, you need a business rule, which should come from you (or your business stake holders).

My suggestion with the information at hand would be to ignore card no, since it seems you can't connect any transactions to an individual card. This leads to do a select distinct account no (proc sort nodupkey if you have other attributes not shared with us), prior to the join with transaction table.

Presuming that there are other attributes on the account/card table, otherwise a join would seem meaningless.

Data never sleeps
Contributor
Posts: 46

Re: How to join the 2 very large datasets with duplicate id keys in both the datasets

@LinusH@Ksharp@Tom@Reeza First off, I apologise for the mistake of not getting the question right. Actually all of you were right and Linus hit the nail on the head. The approach or the design of the business stakeholders wasn't right and my correspondence with them also has been at fault, which led to poor understanding. 

 

I do understand even if the data is as is, the acct-no key will not get the correct card_no unless I have another flag to verify if it's primary or secondary to get the right match.

 

Anyways, leaving aside the business. In general, whenever i tried datastep merge with very large datasets that has duplicates in "both tables", it seems to run for ages with the sort requirement as pre requisite makes it even worse. So, is there a better alternative or just proc sql? I would like to close this thread with something learnt than nothing. Sorry for wasting your time earlier, my bad!

Super User
Posts: 19,780

Re: How to join the 2 very large datasets with duplicate id keys in both the datasets

Posted in reply to Allaluiah
If you have multiples in both files and the files are large it's going to be a big table.
Say table 2 has 8 records for one ID and Table 1 has 4, then the output will have 8*4=32 just for that one ID. So if you have a lot of duplicates your table size will be large and it will take a long time. Check your analysis process and see if you can strip the duplicates before merging somehow. This may mean transforming the data somehow.
Super User
Posts: 5,426

Re: How to join the 2 very large datasets with duplicate id keys in both the datasets

When working with normalized key structures, I'll favour SQL, since it's easier to predict what's gonna happen. And it was adopted to query these kind of structures.

But, it seems that you don' have a truly normalized structure at this point, which leads to the question - where do you get your data from? Hopefully, there is a DW at you site that stores data in a more structured way.

 

It's also good to know it this is an ad-hoc query, or something that will done regularly? If thet latter, consider building a robust table structure (a.k.a. data model) that can handle future requests without reinventing the wheel. In such permanent structure (dimensional model perhaps?) you index your join keys (acc_no in this case) to speed up join performance. You can also speed up thins by storing data in a proper engine. If you use SAS, SPDE is faster than the Base engine in both update (index maintenance) and query.

Data never sleeps
Super User
Super User
Posts: 7,039

Re: How to join the 2 very large datasets with duplicate id keys in both the datasets

Posted in reply to Allaluiah

Can you explain more how you want to use the data?

If there is a variable that indicates if the record is primary or secondary then perhaps you can model the merge as 1-to-1-to-many.

data want ;
   merge table2(where=(type='PRIMARY') keep=acct_no card_no type rename=(card_no=primary_card))
         table2(where=(type='SECONDARY') keep=acct_no card_no type rename=(card_no=secondary_card))
        table1 
  ;
  by acct_no;
run;

Or do you need all of the "transactions" or just the end result of applying them?  if so then perhaps you can apply them and reduce table1 to one record per account.

Super User
Posts: 10,023

Re: How to join the 2 very large datasets with duplicate id keys in both the datasets

Posted in reply to Allaluiah

The point is what kind of output table you would like to see ? You didn't post the output yet .

I noticed in table1 there are  two acct_no =a1 , if there are also  two acct_no =a1 in table2 ?

 

My reckoning is to use Hash Table, since you only want card_no from table2 .

Ask a Question
Discussion stats
  • 7 replies
  • 285 views
  • 0 likes
  • 5 in conversation