12-27-2015 03:04 PM
How to join the 2 very large datasets with duplicate id keys in both the datasets
For example, I have the following datasets
acct_no a b c d
a1 4 7 2 8
a3 5 32 8 12
a5 42 12 54 65
a1 5 2 17 23
acct_no is the key with duplicates because it is a transaction dataset.
acct_no in table 2 also has duplicates because the logic is that a single customer can have more than one card, one a primary card and the other being secondary card
Can anybody help me with the logic for the join?
The datasets contain 400 million records and are pretty wide and long, however i need to fetch only the card_no from table2, so table 1 would be my left table. Sorting will not help either
12-27-2015 04:58 PM
You should post sample output to illustrate your logic. For example how do you want to deal with duplicate card no in table 2? Bring both/all in?
12-27-2015 05:05 PM
First, it seems that your database is not "properly" modelled. You should have a separate account table, and a separate card table.
Then you are talking about the primary card, but there isn't such attribute in your example.
To be able to join those tables in a proper way, you need a business rule, which should come from you (or your business stake holders).
My suggestion with the information at hand would be to ignore card no, since it seems you can't connect any transactions to an individual card. This leads to do a select distinct account no (proc sort nodupkey if you have other attributes not shared with us), prior to the join with transaction table.
Presuming that there are other attributes on the account/card table, otherwise a join would seem meaningless.
12-28-2015 11:42 AM
@LinusH, @Ksharp, @Tom, @Reeza First off, I apologise for the mistake of not getting the question right. Actually all of you were right and Linus hit the nail on the head. The approach or the design of the business stakeholders wasn't right and my correspondence with them also has been at fault, which led to poor understanding.
I do understand even if the data is as is, the acct-no key will not get the correct card_no unless I have another flag to verify if it's primary or secondary to get the right match.
Anyways, leaving aside the business. In general, whenever i tried datastep merge with very large datasets that has duplicates in "both tables", it seems to run for ages with the sort requirement as pre requisite makes it even worse. So, is there a better alternative or just proc sql? I would like to close this thread with something learnt than nothing. Sorry for wasting your time earlier, my bad!
12-28-2015 01:18 PM
12-28-2015 12:36 PM
When working with normalized key structures, I'll favour SQL, since it's easier to predict what's gonna happen. And it was adopted to query these kind of structures.
But, it seems that you don' have a truly normalized structure at this point, which leads to the question - where do you get your data from? Hopefully, there is a DW at you site that stores data in a more structured way.
It's also good to know it this is an ad-hoc query, or something that will done regularly? If thet latter, consider building a robust table structure (a.k.a. data model) that can handle future requests without reinventing the wheel. In such permanent structure (dimensional model perhaps?) you index your join keys (acc_no in this case) to speed up join performance. You can also speed up thins by storing data in a proper engine. If you use SAS, SPDE is faster than the Base engine in both update (index maintenance) and query.
12-27-2015 06:56 PM
Can you explain more how you want to use the data?
If there is a variable that indicates if the record is primary or secondary then perhaps you can model the merge as 1-to-1-to-many.
data want ; merge table2(where=(type='PRIMARY') keep=acct_no card_no type rename=(card_no=primary_card)) table2(where=(type='SECONDARY') keep=acct_no card_no type rename=(card_no=secondary_card)) table1 ; by acct_no; run;
Or do you need all of the "transactions" or just the end result of applying them? if so then perhaps you can apply them and reduce table1 to one record per account.
12-27-2015 07:59 PM
The point is what kind of output table you would like to see ? You didn't post the output yet .
I noticed in table1 there are two acct_no =a1 , if there are also two acct_no =a1 in table2 ?
My reckoning is to use Hash Table, since you only want card_no from table2 .