11-24-2015 03:02 PM
How to the solve the look solution of this problem with very very large dataset?
I have a two datasets, dataset1 and dataset2.
client_id var1_var25 (so in total 26 variables)/8client_id is the key and is unique but not ordered */
dataset2 has duplicate client_ids/* obviously coz the dataset has daily balances of bank customers for the last so many months*/
client_id date daily_balances
1 14/10/2014 5000 /*some date and daily balances of clients until end of file*/
Q1. I need to simply to do a left join to get the daily balances in dataset1 coz i want the daily balances of only those client_id's of dataset1? The problem is that that dataset2 has over 250 million records and dataset1 has a million records. Any efficient solution with SAS 9.2 please?
Q2. I want the last balance of each client_id, just as simple as to sort and last.id after the join?
I have no clue how to manage the look up dataset1(with 1 million records) to dataset2(with 250 million records). I'd appreciate any help please.
11-24-2015 04:58 PM
I'm not sure if I can help you, but I think it would be useful if you could provide the community more information about your problem:
11-24-2015 05:12 PM
There would be options creating indexes and eventually also storing the data using the SPDE engine.
But in the end I believe that taking a "traditional" approach and sorting/merging the data will be fastest in your case and give you the most flexibility to answer multiple questions downstream.
In case you only need the latest record per customer from dataset2 then what could work is to first write ID and latest date to a hash table (1 pass through the data) and then in a second pass only select the records matching with the hash. Write these records to a new table and then sort this new table with the reduced volumes.