BookmarkSubscribeRSS Feed
CharlotteCain
Quartz | Level 8

How to the solve the look solution of this problem with very very large dataset?

I have a two datasets, dataset1 and dataset2.

dataset1 has 

client_id var1_var25 (so in total 26 variables)/8client_id is the key and is unique but not ordered */

1

4

6

9

2

3

 

dataset2 has  duplicate client_ids/* obviously coz the dataset has daily balances of bank customers for the last so many months*/

client_id        date            daily_balances

1             14/10/2014    5000     /*some date and daily balances of clients until end of file*/

2

1

1

3

4

3

2

 

Q1. I need to simply to do a left join to get the daily balances in dataset1 coz i want the daily balances of only those client_id's of dataset1? The problem is that that dataset2 has over 250 million records and dataset1 has a million records. Any efficient solution with SAS 9.2 please?

Q2. I want the last balance of each client_id, just as simple as to sort and last.id after the join?

 

I have no clue how to manage the look up dataset1(with 1 million records) to  dataset2(with 250 million records). I'd appreciate any help please.

 

2 REPLIES 2
FreelanceReinh
Jade | Level 19

I'm not sure if I can help you, but I think it would be useful if you could provide the community more information about your problem:

 

  1. Which platform are you working on? Windows PC, Workstation, Unix server, Mainframe, ...?
  2. Which SAS products are installed at your site? Only Base SAS 9.2?
  3. How much RAM and disk space are available?
  4. What is the size of the two datasets in GB?
  5. Is either of them compressed?
  6. Can you give an estimate as to what percentage of the 250 million records are from CLIENT_IDs contained in dataset1?
  7. Is dataset1 a kind of master dataset (e.g. with names, addresses etc. of the clients) and, if so, why isn't it sorted by CLIENT_ID?
  8. Is this a one-time task or will it have to be done regularly?
  9. Do you have empirical values (run times, resource consumption) from similar tasks (but with smaller datasets) in the past?
  10. Have you considered the use of indexes?
Patrick
Opal | Level 21

There would be options creating indexes and eventually also storing the data using the SPDE engine.

 

But in the end I believe that taking a "traditional" approach and sorting/merging  the data will be fastest in your case and give you the most flexibility to answer multiple questions downstream.

 

In case you only need the latest record per customer from dataset2 then what could work is to first write ID and latest date to a hash table (1 pass through the data) and then in a second pass only select the records matching with the hash. Write these records to a new table and then sort this new table with the reduced volumes.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 402 views
  • 0 likes
  • 3 in conversation