help with inner join to very large table

brulard · Posted 05-05-2017 10:11 AM

hello,

i'm trying to do an inner join but when attempting to join to a particular table, i find myself aborting the query as it is takes too long, or I end up receiving an error message (will post when it appears again).

If anyone has coding suggestion (maybe there is a more efficient way of coding my join to a large table), please let me know.

Here is the code I used:

Proc sql; create table want as
Select a.*
From have as a, database_historical_table b where
a.ID_2=b.ID_2 and a.event_date=b.date and b.ID_status='';quit;

My table "have", has only 30 records. The table "database_historical_table", refreshes daily for all of our card customers, so has billion of records.

ID2 is 16 text char.

The date range is within april '17.

The ID_status is 1 char.

Thank you,

RW9 · Posted 05-05-2017 10:43 AM

I can't really tell from what you post. Some test data in the form of a datastep and what the output should look like would go some ways to clarifying. I would suspect having a smaller intermediary table first off would minimise the rows needed:

proc sql; 
  create table WANT as
  select  A.*
  from    HAVE A
  left join (select * from database_historical_table where ID_STATUS="") B 
  on      a.ID_2=b.ID_2 and a.event_date=b.date;
quit;

That will shrink it, however if you have that many rows it will still be long. SQL probably isnt the tool to do large data. Show what you have and what you want and will be able to say more.

kiranv_ · Posted 05-05-2017 10:59 AM

Instead of join. please use hash technique or proc format.

paper for hash join

http://analytics.ncsu.edu/sesug/2011/BB08.Lafler.pdf

paper for join with proc format

http://www.lexjansen.com/phuse/2007/cc/CC02.pdf

below is the paper which discusses various lookups something what you are doing.

http://lexjansen.com/nesug/nesug10/cc/cc37.pdf

Kurt_Bremser · Posted 05-05-2017 11:02 AM

First, create a sorted sub-dataset:

proc sort
  data=database_historical_table (
    keep=ID_2 date ID_status
    rename=(date=event_date)
    where=(ID_status = '')
  )
  out=big_table (
    drop=ID_status
  )
;
by ID_2 event_date;
run;

Then, do a data step merge on the tables:

data want;
merge
  have (in=a)
  big_table (in=b)
;
by ID_2 event_date;
if a and b;
run;

have should also be sorted first, of course.

The crucial thing is not to have proc sql build a monster utility file and do a gazillion of read/writes to/from that

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 05-05-2017 12:48 PM

And an index to the large table. Proc SQL should be able to use the index if it exists.

help with inner join to very large table

Re: help with inner join to very large table

Re: help with inner join to very large table

Re: help with inner join to very large table

Re: help with inner join to very large table

help with inner join to very large table

Re: help with inner join to very large table

Re: help with inner join to very large table

Re: help with inner join to very large table

Re: help with inner join to very large table

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away