SAS SQL Code Efficiency - Left Join

Ravikumar_RT · Posted 09-28-2017 09:07 AM

Hi All,

I am using following code for the requirment given below, Present it is taking 5 hours to complete the following query because dataset size are huge.

Can somebody help me with efficient SAS code which can reduce processing time.


Proc sql;

   create table INVENTORY as select distinct
       a.*
   ,min(case when (b.PROCESS_DATE GT a.CREATION_DATE and c.MAKING_DATE GE b.PROCESS_DATE and c.MAKING_DATE LE b.PROCESS_DATE_5DAYS and c.AMOUNT LE 0)
              then b.PROCESS_DATE else . end) as CLEARENCE_DATE format=date9.
   ,min(case when (b.PROCESS_DATE GT a.CREATION_DATE and c.MAKING_DATE GE b.PROCESS_DATE and c.MAKING_DATE LE b.PROCESS_DATE_5DAYS and c.AMOUNT LE 0)
              then c.MAKING_DATE else . end) as INV_DATE format=date9.
   from MAIN_DATA as a
   left join PROCESSING_DATA as b   <------(Has Record Count of "682152529")
   on a.ID = b.ID
   left join MAKING_DATA as c   <------(Has Record Count of "374098947")
   on a.ID = c.ID
   group by a.ID, a.PART_ID
   order by a.ID, a.PART_ID;

Quit;

I have one main base dataset "MAIN_DATA" and I am joining my "MAIN_DATA" with two huge datasets called "PROCESSING_DATA" and "MAKING_DATA"
1. MAIN_DATA (This dataset has "CREATION_DATE")
2. PROCESSING_DATA (This dataset has "PROCESS_DATE" and "PROCESS_DATE_5Days" )
3. MAKING_DATA (This dataset has "MAKING_DATE" and "AMOUNT")

1. Requirement for Identifying "Clearence Date":
I want to know earliest "PROCESS_DATE" date for the following:
"PROCESS_DATE" should be After "CREATION_DATE" and "MAKING_DATE" should be between "PROCESS_DATE" and "PROCESS_DATE_5Days" and "AMOUNT" should be ZERO(0) or less than ZERO(0)

2. Requirement for Identifying "Inventory Date":
I want to know earliest "MAKING_DATE" date for the following:
"PROCESS_DATE" should be After "CREATION_DATE" and "MAKING_DATE" should be between "PROCESS_DATE" and "PROCESS_DATE_5Days" and "AMOUNT" should be ZERO(0) or less than ZERO(0)

Thanks,

Ravi

Kurt_Bremser · Posted 09-28-2017 09:45 AM

Using select distinct on a long list of variables is usually a VERY BAD IDEA, as it forces SQL to sort by ALL variables to remove doubles.

Your first task shall therefore be to determine if that distinct is necessary, and how to work around it if it is (eg if only a subgroup of variables contains the necessary values to determine the distinct, you can do a sort nodupkey using that).

Next you need to inspect your relationships. If you have a 1:n relationship, you can replace that join with a sort and data step merge, which usually outperforms SQL up to orders of magnitude.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

s_lassen · Posted 09-29-2017 03:40 AM

How large is your MAIN_DATA table? If it has significantly fewer rows than the other two tables, you may get better performance by putting indexes on the large tables:

proc sql;
  create index ID on PROCESSING_DATA(ID);
  create index ID on MAKING_DATA(ID);
quit;

Of course, the feasibility and efficiency of this also depends on how often (and how) you update the two large tables. But often the creation of an index takes less time than what you can save in a single query.

Ravikumar_RT · Posted 09-29-2017 11:29 AM

Hi,

Thanks for your suggestion, I did tried processing by creating index, but still it took more than 5 hours to complete.

MAIN_DATA - "1249382" (Total Record count)
PROCESSING_DATA - "682152529" (Total Record count)
MAKING_DATA - "374098947" (Total Record count)

Please could you help me with any alternative SAS code which helps in reducing processing time.

My requirement is same:

I have one main base dataset "MAIN_DATA" and I am joining my "MAIN_DATA" with two huge datasets called "PROCESSING_DATA" and "MAKING_DATA"
1. MAIN_DATA (This dataset has "CREATION_DATE")
2. PROCESSING_DATA (This dataset has "PROCESS_DATE" and "PROCESS_DATE_5Days" )
3. MAKING_DATA (This dataset has "MAKING_DATE" and "AMOUNT")

1. Requirement for Identifying "Clearence Date":
I want to know earliest "PROCESS_DATE" date for the following:
"PROCESS_DATE" should be After "CREATION_DATE" and "MAKING_DATE" should be between "PROCESS_DATE" and "PROCESS_DATE_5Days" and "AMOUNT" should be ZERO(0) or less than ZERO(0)

2. Requirement for Identifying "Inventory Date":
I want to know earliest "MAKING_DATE" date for the following:
"PROCESS_DATE" should be After "CREATION_DATE" and "MAKING_DATE" should be between "PROCESS_DATE" and "PROCESS_DATE_5Days" and "AMOUNT" should be ZERO(0) or less than ZERO(0)

Once again thanks alot for your help.

Regards,
Ravi

SAS SQL Code Efficiency - Left Join

Re: SAS SQL Code Efficiency - Left Join

Re: SAS SQL Code Efficiency - Left Join

Re: SAS SQL Code Efficiency - Left Join

SAS SQL Code Efficiency - Left Join

Re: SAS SQL Code Efficiency - Left Join

Re: SAS SQL Code Efficiency - Left Join

Re: SAS SQL Code Efficiency - Left Join

Click image to register for webinar

Classroom Training Available!