topic Re: Split multiple join to optimize time in SAS Programming

Split multiple join to optimize time

nirupama1 — Sat, 02 May 2020 21:07:41 GMT

proc sql;

create table call as

select

input(cdf.case_id,best12.) as mbr_pgm_id ,

count(distinct(af.actv_id)) as AttemptedCallsN ,

cdf.TOUCH_DT_DIMID

from hmaodm.cse_dtl_fact cdf

left join hmaodm.actv_fact af

on cdf.case_id = af.case_id

and af.cse_src_sys_cd in ('CPM')

left join HMAODM.ref_dim RDA

on af.actv_rsltn_otcome_dimid = rda.ref_dimid

and rda.ref_nm = 'activityResolutionOutcomeType'

left join HMAODM.CUST_SEG_DIM CSD

on CDF.CUST_SEG_DIMID = CSD.CUST_SEG_DIMID

where cdf.qual_dt_dimid > 0

and cdf.touch_dt_dimid = .

and rda.ref_cd in (&actv_typ_id.)

and RDA.ref_cd <> '44' AND CSD.CUST_SEG_NBR IN (&user_defined_policy.)

AND (CDF.CSE_OPEN_DT_DIMID <= &end_yyyymmdd.

AND (CDF.CSE_CLOS_DT_DIMID = 99991231 or CDF.CSE_CLOS_DT_DIMID >= &beg_yyyymmdd.))

AND CDF.MEDCR_IND_DIMID <>1

group by cdf.case_id, cdf.TOUCH_DT_DIMID;

quit;

This multiple join is taking 50+ hours to run. Please help me to optimize it, so that it takes less execution time.

I tried to split the above query as below but the output does not match. Please help! It is urgent .

proc sql noprint;

create table call1 as

select

cdf.case_id ,

af.actv_rsltn_otcome_dimid,

count(distinct(actv_id)) as AttemptedCallsN

from CRP_HMAODM.cse_dtl_fact cdf

left join CRP_HMAODM.actv_fact af

on cdf.case_id = af.case_id

and af.cse_src_sys_cd in ('CPM')

where cdf.qual_dt_dimid > 0

and cdf.touch_dt_dimid IS NULL

AND (CDF.CSE_OPEN_DT_DIMID <= &end_yyyymmdd.

AND (CDF.CSE_CLOS_DT_DIMID = 99991231 or CDF.CSE_CLOS_DT_DIMID >= &beg_yyyymmdd.))

AND CDF.MEDCR_IND_DIMID <>1

group by cdf.case_id,af.actv_rsltn_otcome_dimid

order by CDF.case_id

;

quit;

proc sql ;

create table call2 as

select distinct CALL1.*

from call1

left join HMAODM.ref_dim RDA

on call1.actv_rsltn_otcome_dimid = rda.ref_dimid

and rda.ref_nm = 'activityResolutionOutcomeType'

where rda.ref_cd in (&actv_typ_id.)

AND RDA.ref_cd <> '44'

order by call1.case_id

;

quit;

proc sql noprint;

create table call3 as

select distinct

CDF.case_id

from CRP_HMAODM.cse_dtl_fact cdf

left join CRP_HMAODM.CUST_SEG_DIM CSD

on CDF.CUST_SEG_DIMID = CSD.CUST_SEG_DIMID

where

CSD.CUST_SEG_NBR IN (&user_defined_policy.)

order by CDF.case_id

;

quit;

proc sort data=call2;by case_id;RUN;

proc sort data=call3;by case_id;RUN;

data call23;

merge call2 (IN=A)

call3 (IN=B)

;

By case_id;

If A THEN OUTPUT;

RUN;

Re: Split multiple join to optimize time

ChrisNZ — Sat, 02 May 2020 23:57:45 GMT

1. It looks like the table rda is only used for filtering. Why not have an inner join?

2. It looks like table csd is not used at all.

3. What # of records do you anticipate from each table? Are they indexed?

4. Why not format your code so it can be read without getting a headache? Something like

create table CALL as
select 
    input(cdf.CASE_ID,best12.)  as mbr_pgm_id 
  , count(distinct(af.ACTV_ID)) as AttemptedCallsN 
  , cdf.TOUCH_DT_DIMID
from HMAODM.CSE_DTL_FACT                  cdf
       left join  
     HMAODM.ACTV_FACT                     af
       on  cdf.CASE_ID = af.CASE_ID
       and af.CSE_SRC_SYS_CD in ('CPM')
       left join 
    HMAODM.REF_DIM                        rda
      on  af.ACTV_RSLTN_OTCOME_DIMID = rda.REF_DIMID
      and rda.REF_NM                 = 'activityResolutionOutcomeType'
      left join 
    HMAODM.CUST_SEG_DIM                   csd
      on cdf.CUST_SEG_DIMID =  csd.CUST_SEG_DIMID 
where cdf.QUAL_DT_DIMID     >  0
  and cdf.TOUCH_DT_DIMID    =  .
  and rda.REF_CD            in (&actv_typ_id.)                                                                                                                                        
  and rda.REF_CD            <> '44'                                                                                                                                                                 AND CSD.CUST_SEG_NBR IN (&user_defined_policy.)           
  and cdf.CSE_OPEN_DT_DIMID <= &end_yyyymmdd.            
  and cdf.CSE_CLOS_DT_DIMID >= &beg_yyyymmdd.
  and cdf.MEDCR_IND_DIMID   <> 1
group by cdf.CASE_ID, cdf.TOUCH_DT_DIMID;
quit;

Re: Split multiple join to optimize time

ChrisNZ — Sun, 03 May 2020 00:03:35 GMT

Oh my goodness, table csd is used. Except the AND clause is in column 200!

Do yourself a favour and format your code properly!

So it looks like table csd should be an inner join too?

So maybe try

cdf

inner join

(most discriminating match, rda or csd?)

inner join

(rda or csd)

left join

Re: Split multiple join to optimize time

Kurt_Bremser — Sun, 03 May 2020 07:04:26 GMT

Please supply some additional information.

Are these libraries local SAS libraries, or do you connect to a remote database?

How many observations do the datasets have, with regard to the subsets you do?

What are the attributes of the variables you pull (types, lengths)?

You also have stupid storage of dates. Fix that before you even continue.

This

input(cdf.case_id,best12.) as mbr_pgm_id

is also bad, keep id's in character.

Re: Split multiple join to optimize time

nirupama1 — Sun, 03 May 2020 11:11:19 GMT

These are local sas libraries. I know it is not appropriately written. The conditions are fine, it generates the desired output.
The actv_fac table is very huge (records in millions).
There is only need to minimize time.
I think the issue is with joins. I tried multiple ways but the result is not the same as the original query.

Re: Split multiple join to optimize time

nirupama1 — Sun, 03 May 2020 11:17:04 GMT

Cse_dtl_fact and actv_fac both are very huge tables

Re: Split multiple join to optimize time

ChrisNZ — Sun, 03 May 2020 11:29:14 GMT

Have you tried inner joins?

What # of records do you anticipate from each table before and after the joins?

Are the tables sorted? indexed?

Re: Split multiple join to optimize time

Kurt_Bremser — Sun, 03 May 2020 11:39:34 GMT

@nirupama1 wrote:
Cse_dtl_fact and actv_fac both are very huge tables

What you call huge might be a little flyspeck for someone else. Please give us the numbers. Not "millions", but 5 millions, 10 millions, and so on. And the sizes of the required variables.

e. g. how many observations of ACTV_FACT satisfy the condition

CSE_SRC_SYS_CD in ('CPM')

and what are the attributes of CASE_ID and ACTV_ID?

This will determine if and where we can use in-memory techniques.

Re: Split multiple join to optimize time

nirupama1 — Sun, 03 May 2020 12:11:15 GMT

6 million observations fall under CSE_SRC_SYS_CD in ('CPM') along with where condition.
If I run the original query by simply removing join with the ref_dim table and where condition of ref_dim table, the output is 55k records.

Re: Split multiple join to optimize time

nirupama1 — Sun, 03 May 2020 12:14:30 GMT

And this run within seconds but as soon as I join with ref_dim table, it executes in about 2days or more and output results in between13k-18k.

Re: Split multiple join to optimize time

Kurt_Bremser — Sun, 03 May 2020 12:16:49 GMT

And what are the lengths of case_id, actv_id and actv_rsltn_otcome_dimid?

Re: Split multiple join to optimize time

nirupama1 — Sun, 03 May 2020 12:37:03 GMT

I am not sure what exactly u mean by length....
These ids are of 9-10 characters length.
case_id=9charactets
actv_id =10 characters
actv_rsltn_otcome_dimid=maybe negative number or 5-6 digits

Re: Split multiple join to optimize time

Kurt_Bremser — Sun, 03 May 2020 12:40:36 GMT

Maxim 3: Know Your Data.

Run PROC CONTENTS on your tables, so you get to know the types and other attributes of your columns.

Re: Split multiple join to optimize time

nirupama1 — Sun, 03 May 2020 12:51:44 GMT

Length
case_id=30
actv_id =30
actv_rsltn_otcome_dimid=8

Re: Split multiple join to optimize time

Kurt_Bremser — Sun, 03 May 2020 13:47:19 GMT

So this table would need roughly 6 million * 70 bytes in memory, less than 500 MB.

Let's start with the first join:

data call;
set hmaodm.cse_dtl_fact;
if _n_ = 1
then do;
  length
    actv_id $30
    actv_rsltn_otcome_dimid 8
  ;
  declare hash af (dataset:"hmaodm.actv_fact (where=(cse_src_sys_cd in ('CPM')))");
  af.definekey("case_id");
  af.definedata("actv_id","actv_rsltn_otcome_dimid");
  af.definedone();
  call missing(actv_id,actv_rsltn_otcome_dimid);
end;
rc = af.find();
run;

Re: Split multiple join to optimize time

ballardw — Mon, 04 May 2020 08:36:32 GMT

I would be strongly tempted to replace code like

       left join  hmaodm.actv_fact af

       on cdf.case_id = af.case_id

       and  af.cse_src_sys_cd in ('CPM')

with something like

       left join  (select * from hmaodm.actv_fact
                     where cse_src_sys_cd in ('CPM'))  af

       on cdf.case_id = af.case_id

to reduce the number of records brought into the join. You have opportunities for this at many of your joins

Re: Split multiple join to optimize time

ChrisNZ — Tue, 05 May 2020 02:36:29 GMT

@ballardw

Doesn't the SQL parser subset the right table when it sees

and  af.cse_src_sys_cd in ('CPM')

Re: Split multiple join to optimize time

ChrisNZ — Tue, 05 May 2020 03:04:33 GMT

@ballardw

Both SQL steps take 4.8s on my machine.

data A(sortedby=I) B(sortedby=I);
  do I=1 to 1e7;
    output;
  end;
run;   
proc sql _method;
  create table T as
  select A.I, b.I as J
  from A left join B on a.I=b.I and b.I=1e7;
quit;
proc sql _method;
  create table T as
  select A.I, b.I as J
  from A left join B(where=(I=1e7)) on a.I=b.I ;
quit;

Re: Split multiple join to optimize time

ballardw — Tue, 05 May 2020 14:59:05 GMT

@ChrisNZ wrote:

@ballardw

Both SQL steps take 4.8s on my machine.

data A(sortedby=I) B(sortedby=I);
  do I=1 to 1e7;
    output;
  end;
run;   
proc sql _method;
  create table T as
  select A.I, b.I as J
  from A left join B on a.I=b.I and b.I=1e7;
quit;
proc sql _method;
  create table T as
  select A.I, b.I as J
  from A left join B(where=(I=1e7)) on a.I=b.I ;
quit;

Cached data sets?

I haven't tested the suggestion in a while but when I had some data across network drives sub-setting the data had some positive impact in my environment.

Re: Split multiple join to optimize time

ChrisNZ — Tue, 05 May 2020 23:21:57 GMT

Interesting. Something to keep one's eyes on then.

It'd be disappointing if the SQL optimiser did not subset the table. That's such an obvious step.