DATA Step, Macro, Functions and more

SAS SQL Code Efficiency - Left Join

Reply
Occasional Contributor
Posts: 7

SAS SQL Code Efficiency - Left Join

Hi All,

I am using following code for the requirment given below, Present it is taking 5 hours to complete the following query because dataset size are huge.


Can somebody help me with efficient SAS code which can reduce processing time.

     
Proc sql;

    create table INVENTORY as select distinct
       a.*
    ,min(case when (b.PROCESS_DATE GT a.CREATION_DATE and c.MAKING_DATE GE b.PROCESS_DATE and c.MAKING_DATE LE b.PROCESS_DATE_5DAYS and c.AMOUNT LE 0)
              then b.PROCESS_DATE else . end) as CLEARENCE_DATE format=date9.
    ,min(case when (b.PROCESS_DATE GT a.CREATION_DATE and c.MAKING_DATE GE b.PROCESS_DATE and c.MAKING_DATE LE b.PROCESS_DATE_5DAYS and c.AMOUNT LE 0)
              then c.MAKING_DATE  else . end) as INV_DATE format=date9.
    from MAIN_DATA as a
    left join PROCESSING_DATA as b    <------(Has Record Count of "682152529")
    on a.ID = b.ID
    left join MAKING_DATA as c    <------(Has Record Count of "374098947")
    on a.ID = c.ID  
    group by a.ID, a.PART_ID
    order by a.ID, a.PART_ID;

Quit;

I have one main base dataset "MAIN_DATA" and I am joining my "MAIN_DATA" with two huge datasets called "PROCESSING_DATA" and "MAKING_DATA"
1. MAIN_DATA (This dataset has "CREATION_DATE")
2. PROCESSING_DATA (This dataset has "PROCESS_DATE" and "PROCESS_DATE_5Days" )
3. MAKING_DATA (This dataset has "MAKING_DATE" and "AMOUNT")

1. Requirement for Identifying "Clearence Date":
I want to know earliest "PROCESS_DATE" date for the following:
"PROCESS_DATE" should be After "CREATION_DATE" and "MAKING_DATE" should be between "PROCESS_DATE" and "PROCESS_DATE_5Days" and "AMOUNT" should be ZERO(0) or less than ZERO(0)

2. Requirement for Identifying "Inventory Date":
I want to know earliest "MAKING_DATE" date for the following:
"PROCESS_DATE" should be After "CREATION_DATE" and "MAKING_DATE" should be between "PROCESS_DATE" and "PROCESS_DATE_5Days" and "AMOUNT" should be ZERO(0) or less than ZERO(0)

 

Thanks,

Ravi

Super User
Posts: 7,809

Re: SAS SQL Code Efficiency - Left Join

Posted in reply to Ravikumar_RT

Using select distinct on a long list of variables is usually a VERY BAD IDEA, as it forces SQL to sort by ALL variables to remove doubles.

Your first task shall therefore be to determine if that distinct is necessary, and how to work around it if it is (eg if only a subgroup of variables contains the necessary values to determine the distinct, you can do a sort nodupkey using that).

Next you need to inspect your relationships. If you have a 1:n relationship, you can replace that join with a sort and data step merge, which usually outperforms SQL up to orders of magnitude.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
PROC Star
Posts: 102

Re: SAS SQL Code Efficiency - Left Join

Posted in reply to Ravikumar_RT

How large is your MAIN_DATA table? If it has significantly fewer rows than the other two tables, you may get better performance by putting indexes on the large tables:

proc sql;
  create index ID on PROCESSING_DATA(ID);
  create index ID on MAKING_DATA(ID);
quit;

Of course, the feasibility and efficiency of this also depends on how often (and how) you update the two large tables. But often the creation of an index takes less time than what you can save in a single query.

Occasional Contributor
Posts: 7

Re: SAS SQL Code Efficiency - Left Join

Hi,

Thanks for your suggestion, I did tried processing by creating index, but still it took more than 5 hours to complete.

MAIN_DATA - "1249382" (Total Record count)
PROCESSING_DATA - "682152529" (Total Record count)
MAKING_DATA - "374098947" (Total Record count)

Please could you help me with any alternative SAS code which helps in reducing processing time.

My requirement is same:

I have one main base dataset "MAIN_DATA" and I am joining my "MAIN_DATA" with two huge datasets called "PROCESSING_DATA" and "MAKING_DATA"
1. MAIN_DATA (This dataset has "CREATION_DATE")
2. PROCESSING_DATA (This dataset has "PROCESS_DATE" and "PROCESS_DATE_5Days" )
3. MAKING_DATA (This dataset has "MAKING_DATE" and "AMOUNT")

1. Requirement for Identifying "Clearence Date":
I want to know earliest "PROCESS_DATE" date for the following:
"PROCESS_DATE" should be After "CREATION_DATE" and "MAKING_DATE" should be between "PROCESS_DATE" and "PROCESS_DATE_5Days" and "AMOUNT" should be ZERO(0) or less than ZERO(0)

2. Requirement for Identifying "Inventory Date":
I want to know earliest "MAKING_DATE" date for the following:
"PROCESS_DATE" should be After "CREATION_DATE" and "MAKING_DATE" should be between "PROCESS_DATE" and "PROCESS_DATE_5Days" and "AMOUNT" should be ZERO(0) or less than ZERO(0)

Once again thanks alot for your help.

Regards,
Ravi
Ask a Question
Discussion stats
  • 3 replies
  • 142 views
  • 2 likes
  • 3 in conversation