BookmarkSubscribeRSS Feed
Allaluiah
Quartz | Level 8

How to join the 2 very large datasets with duplicate id keys in both the datasets

 

For example, I have the following datasets

 

table1 :

 

acct_no    a                  b                  c                  d

a1                4                  7                  2                  8

a3                5                  32                8                  12

a5                42                12                54                65

a1                5                  2                  17                23

 

acct_no is the key with duplicates because it is a transaction dataset.

 

table 2:

acct_no    card_no

a4                54

a3                31

a4                43

a1                24

a8                12

a7                23

a8                45

 

acct_no in table 2 also has duplicates because the logic is that a single customer can have more than one card, one a primary card and the other being secondary card

 

Can anybody help me with the logic for the join?

 

The datasets contain 400 million records and are pretty wide and long, however i need to fetch only the card_no from table2, so table 1 would be my left table. Sorting will not help either

 

7 REPLIES 7
Reeza
Super User

You should post sample output to illustrate your logic. For example how do you want to deal with duplicate card no in table 2? Bring both/all in? 

LinusH
Tourmaline | Level 20

First, it seems that your database is not "properly" modelled. You should have a separate account table, and a separate card table.

Then you are talking about the primary card, but there isn't such attribute in your example.

To be able to join those tables in a proper way, you need a business rule, which should come from you (or your business stake holders).

My suggestion with the information at hand would be to ignore card no, since it seems you can't connect any transactions to an individual card. This leads to do a select distinct account no (proc sort nodupkey if you have other attributes not shared with us), prior to the join with transaction table.

Presuming that there are other attributes on the account/card table, otherwise a join would seem meaningless.

Data never sleeps
Allaluiah
Quartz | Level 8

@LinusH@Ksharp@Tom@Reeza First off, I apologise for the mistake of not getting the question right. Actually all of you were right and Linus hit the nail on the head. The approach or the design of the business stakeholders wasn't right and my correspondence with them also has been at fault, which led to poor understanding. 

 

I do understand even if the data is as is, the acct-no key will not get the correct card_no unless I have another flag to verify if it's primary or secondary to get the right match.

 

Anyways, leaving aside the business. In general, whenever i tried datastep merge with very large datasets that has duplicates in "both tables", it seems to run for ages with the sort requirement as pre requisite makes it even worse. So, is there a better alternative or just proc sql? I would like to close this thread with something learnt than nothing. Sorry for wasting your time earlier, my bad!

Reeza
Super User
If you have multiples in both files and the files are large it's going to be a big table.
Say table 2 has 8 records for one ID and Table 1 has 4, then the output will have 8*4=32 just for that one ID. So if you have a lot of duplicates your table size will be large and it will take a long time. Check your analysis process and see if you can strip the duplicates before merging somehow. This may mean transforming the data somehow.
LinusH
Tourmaline | Level 20

When working with normalized key structures, I'll favour SQL, since it's easier to predict what's gonna happen. And it was adopted to query these kind of structures.

But, it seems that you don' have a truly normalized structure at this point, which leads to the question - where do you get your data from? Hopefully, there is a DW at you site that stores data in a more structured way.

 

It's also good to know it this is an ad-hoc query, or something that will done regularly? If thet latter, consider building a robust table structure (a.k.a. data model) that can handle future requests without reinventing the wheel. In such permanent structure (dimensional model perhaps?) you index your join keys (acc_no in this case) to speed up join performance. You can also speed up thins by storing data in a proper engine. If you use SAS, SPDE is faster than the Base engine in both update (index maintenance) and query.

Data never sleeps
Tom
Super User Tom
Super User

Can you explain more how you want to use the data?

If there is a variable that indicates if the record is primary or secondary then perhaps you can model the merge as 1-to-1-to-many.

data want ;
   merge table2(where=(type='PRIMARY') keep=acct_no card_no type rename=(card_no=primary_card))
         table2(where=(type='SECONDARY') keep=acct_no card_no type rename=(card_no=secondary_card))
        table1 
  ;
  by acct_no;
run;

Or do you need all of the "transactions" or just the end result of applying them?  if so then perhaps you can apply them and reduce table1 to one record per account.

Ksharp
Super User

The point is what kind of output table you would like to see ? You didn't post the output yet .

I noticed in table1 there are  two acct_no =a1 , if there are also  two acct_no =a1 in table2 ?

 

My reckoning is to use Hash Table, since you only want card_no from table2 .

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2350 views
  • 0 likes
  • 5 in conversation