How to the solve the look solution of this problem with very very large dataset?

Reply
Frequent Contributor
Posts: 137

How to the solve the look solution of this problem with very very large dataset?

How to the solve the look solution of this problem with very very large dataset?

I have a two datasets, dataset1 and dataset2.

dataset1 has 

client_id var1_var25 (so in total 26 variables)/8client_id is the key and is unique but not ordered */

1

4

6

9

2

3

 

dataset2 has  duplicate client_ids/* obviously coz the dataset has daily balances of bank customers for the last so many months*/

client_id        date            daily_balances

1             14/10/2014    5000     /*some date and daily balances of clients until end of file*/

2

1

1

3

4

3

2

 

Q1. I need to simply to do a left join to get the daily balances in dataset1 coz i want the daily balances of only those client_id's of dataset1? The problem is that that dataset2 has over 250 million records and dataset1 has a million records. Any efficient solution with SAS 9.2 please?

Q2. I want the last balance of each client_id, just as simple as to sort and last.id after the join?

 

I have no clue how to manage the look up dataset1(with 1 million records) to  dataset2(with 250 million records). I'd appreciate any help please.

 

Trusted Advisor
Posts: 1,118

Re: How to the solve the look solution of this problem with very very large dataset?

Posted in reply to CharlotteCain

I'm not sure if I can help you, but I think it would be useful if you could provide the community more information about your problem:

 

  1. Which platform are you working on? Windows PC, Workstation, Unix server, Mainframe, ...?
  2. Which SAS products are installed at your site? Only Base SAS 9.2?
  3. How much RAM and disk space are available?
  4. What is the size of the two datasets in GB?
  5. Is either of them compressed?
  6. Can you give an estimate as to what percentage of the 250 million records are from CLIENT_IDs contained in dataset1?
  7. Is dataset1 a kind of master dataset (e.g. with names, addresses etc. of the clients) and, if so, why isn't it sorted by CLIENT_ID?
  8. Is this a one-time task or will it have to be done regularly?
  9. Do you have empirical values (run times, resource consumption) from similar tasks (but with smaller datasets) in the past?
  10. Have you considered the use of indexes?
Respected Advisor
Posts: 4,173

Re: How to the solve the look solution of this problem with very very large dataset?

Posted in reply to CharlotteCain

There would be options creating indexes and eventually also storing the data using the SPDE engine.

 

But in the end I believe that taking a "traditional" approach and sorting/merging  the data will be fastest in your case and give you the most flexibility to answer multiple questions downstream.

 

In case you only need the latest record per customer from dataset2 then what could work is to first write ID and latest date to a hash table (1 pass through the data) and then in a second pass only select the records matching with the hash. Write these records to a new table and then sort this new table with the reduced volumes.

Ask a Question
Discussion stats
  • 2 replies
  • 244 views
  • 0 likes
  • 3 in conversation