DATA Step, Macro, Functions and more

How to create SAS Scalable Performance Data Engine tables on the Hadoop Distributed File System

Reply
Contributor
Posts: 46

How to create SAS Scalable Performance Data Engine tables on the Hadoop Distributed File System

How to join the 2 very large datasets with duplicate id keys in both the datasets

 

For example, I have the following datasets

 

table1 :

 

acct_no    a                  b                  c                  d

a1                4                  7                  2                  8

a3                5                  32                8                  12

a5                42                12                54                65

a1                5                  2                  17                23

 

acct_no is the key with duplicates because it is a transaction dataset.

 

table 2:

acct_no    card_no

a4                54

a3                31

a4                43

a1                24

a8                12

a7                23

a8                45

 

acct_no in table 2 also has duplicates because the logic is that a single customer can have more than one card, one a primary card and the other being secondary card

 

Can anybody help me with the logic for the join?

 

The datasets contain 400 million records and are pretty wide and long, however i need to fetch only the card_no from table2, so table 1 would be my left table. Sorting will not help either

 

Super User
Posts: 5,435

Re: How to create SAS Scalable Performance Data Engine tables on the Hadoop Distributed File System

Posted in reply to Allaluiah

Contents of this post seem like a duplicate.

But the title is totally different from the question?

Data never sleeps
Ask a Question
Discussion stats
  • 1 reply
  • 143 views
  • 0 likes
  • 2 in conversation