Joining two datasets with millions of observations

Accepted Solution Solved
Reply
Regular Contributor
Posts: 161
Accepted Solution

Joining two datasets with millions of observations

I'm running the following code on two datasets that is very long, but not fat.  Both contains over 1 million observations.  The SAS code take forever.  And it is not reporting any errors.  May I ask what went wrong with the following code?  The series variables are numeric value approaching 1 billion.  They are used is key for merging dataset.  Change the merging method to inner join does not improve also.  May I ask who to deal with the problem?

proc sql;

  create table WANT as

  select *

  from HAVE1 as a left join HAVE2 as b

  on b.series_beg <= a.series <= b.series_end;

quit;


Accepted Solutions
Solution
‎06-30-2013 01:09 AM
Super User
Posts: 10,020

Re: Joining two datasets with millions of observations

Posted in reply to caveman529

You are using Cartesian Product ,that is unfit to large table.

Hash Table is a good choice.

View solution in original post


All Replies
Solution
‎06-30-2013 01:09 AM
Super User
Posts: 10,020

Re: Joining two datasets with millions of observations

Posted in reply to caveman529

You are using Cartesian Product ,that is unfit to large table.

Hash Table is a good choice.

Contributor
Posts: 41

Re: Joining two datasets with millions of observations

Posted in reply to caveman529

1. I would run a small amount of obs by using inobs= or outobs= option to check that your logic is correct. Do you really need to have all variables? try limiting the amount of variables.

2. Check using proc setinit what SAS engines that you may have... the use of SAS/Access Engines will improve the performance (in-database processing) if you are using a database.

3. Try options Bufsize and Bufno as this can help.

4. Hash table are good as long as you have enough resources (i.e. memory).

Super User
Posts: 5,424

Re: Joining two datasets with millions of observations

Posted in reply to gra_in_aus

In addition to these tips, you might want to consider these options:

  • Have the tables sorted prior to the join (have1.series and have2.series_beg). Make sure that sortedby= is set for the sorted tables.
  • Take all available RAM (MEMSIZE and SORTSIZE global options)
  • Another option to recode the joi nto use data step hashing, would be to load have2 into memory (SASFILE statement). I think it would be interesting to see how that affects the processing.
Data never sleeps
Regular Contributor
Posts: 161

Re: Joining two datasets with millions of observations

Thank you, guys.  I think it is because the product is just too big.  I subset my data to 1000 and it completed within 5 minutes.  I'm using school's server.  I don't think I can set the system option such as bufsize and bufno, can I ?

The RAM is pretty big.  How can I make use of that on a Linux server?  Again, it seems to be that memsize and sortsize are both system options that I cannot change as a users.  But is there a way to include such in my code?


Thank you!

Super User
Posts: 5,424

Re: Joining two datasets with millions of observations

Posted in reply to caveman529

Yes, you can change global system options in your program, there should be no restrictions.

Data never sleeps
Super User
Posts: 5,497

Re: Joining two datasets with millions of observations

Posted in reply to caveman529

Is there a sorted order to either data set?

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 409 views
  • 7 likes
  • 5 in conversation