Re: Split multiple join to optimize time

nirupama1 · Posted 05-02-2020 05:07 PM

proc sql;

create table call as

select

input(cdf.case_id,best12.) as mbr_pgm_id ,

count(distinct(af.actv_id)) as AttemptedCallsN ,

cdf.TOUCH_DT_DIMID

from hmaodm.cse_dtl_fact cdf

left join hmaodm.actv_fact af

on cdf.case_id = af.case_id

and af.cse_src_sys_cd in ('CPM')

left join HMAODM.ref_dim RDA

on af.actv_rsltn_otcome_dimid = rda.ref_dimid

and rda.ref_nm = 'activityResolutionOutcomeType'

left join HMAODM.CUST_SEG_DIM CSD

on CDF.CUST_SEG_DIMID = CSD.CUST_SEG_DIMID

where cdf.qual_dt_dimid > 0

and cdf.touch_dt_dimid = .

and rda.ref_cd in (&actv_typ_id.)

and RDA.ref_cd <> '44' AND CSD.CUST_SEG_NBR IN (&user_defined_policy.)

AND (CDF.CSE_OPEN_DT_DIMID <= &end_yyyymmdd.

AND (CDF.CSE_CLOS_DT_DIMID = 99991231 or CDF.CSE_CLOS_DT_DIMID >= &beg_yyyymmdd.))

AND CDF.MEDCR_IND_DIMID <>1

group by cdf.case_id, cdf.TOUCH_DT_DIMID;

quit;

This multiple join is taking 50+ hours to run. Please help me to optimize it, so that it takes less execution time.

I tried to split the above query as below but the output does not match. Please help! It is urgent .

proc sql noprint;

create table call1 as

select

cdf.case_id ,

af.actv_rsltn_otcome_dimid,

count(distinct(actv_id)) as AttemptedCallsN

from CRP_HMAODM.cse_dtl_fact cdf

left join CRP_HMAODM.actv_fact af

on cdf.case_id = af.case_id

and af.cse_src_sys_cd in ('CPM')

where cdf.qual_dt_dimid > 0

and cdf.touch_dt_dimid IS NULL

AND (CDF.CSE_OPEN_DT_DIMID <= &end_yyyymmdd.

AND (CDF.CSE_CLOS_DT_DIMID = 99991231 or CDF.CSE_CLOS_DT_DIMID >= &beg_yyyymmdd.))

AND CDF.MEDCR_IND_DIMID <>1

group by cdf.case_id,af.actv_rsltn_otcome_dimid

order by CDF.case_id

;

quit;

proc sql ;

create table call2 as

select distinct CALL1.*

from call1

left join HMAODM.ref_dim RDA

on call1.actv_rsltn_otcome_dimid = rda.ref_dimid

and rda.ref_nm = 'activityResolutionOutcomeType'

where rda.ref_cd in (&actv_typ_id.)

AND RDA.ref_cd <> '44'

order by call1.case_id

;

quit;

proc sql noprint;

create table call3 as

select distinct

CDF.case_id

from CRP_HMAODM.cse_dtl_fact cdf

left join CRP_HMAODM.CUST_SEG_DIM CSD

on CDF.CUST_SEG_DIMID = CSD.CUST_SEG_DIMID

where

CSD.CUST_SEG_NBR IN (&user_defined_policy.)

order by CDF.case_id

;

quit;

proc sort data=call2;by case_id;RUN;

proc sort data=call3;by case_id;RUN;

data call23;

merge call2 (IN=A)

call3 (IN=B)

;

By case_id;

If A THEN OUTPUT;

RUN;

ChrisNZ · Posted 05-02-2020 07:57 PM

1. It looks like the table rda is only used for filtering. Why not have an inner join?

2. It looks like table csd is not used at all.

3. What # of records do you anticipate from each table? Are they indexed?

4. Why not format your code so it can be read without getting a headache? Something like

create table CALL as
select 
    input(cdf.CASE_ID,best12.)  as mbr_pgm_id 
  , count(distinct(af.ACTV_ID)) as AttemptedCallsN 
  , cdf.TOUCH_DT_DIMID
from HMAODM.CSE_DTL_FACT                  cdf
       left join  
     HMAODM.ACTV_FACT                     af
       on  cdf.CASE_ID = af.CASE_ID
       and af.CSE_SRC_SYS_CD in ('CPM')
       left join 
    HMAODM.REF_DIM                        rda
      on  af.ACTV_RSLTN_OTCOME_DIMID = rda.REF_DIMID
      and rda.REF_NM                 = 'activityResolutionOutcomeType'
      left join 
    HMAODM.CUST_SEG_DIM                   csd
      on cdf.CUST_SEG_DIMID =  csd.CUST_SEG_DIMID 
where cdf.QUAL_DT_DIMID     >  0
  and cdf.TOUCH_DT_DIMID    =  .
  and rda.REF_CD            in (&actv_typ_id.)                                                                                                                                        
  and rda.REF_CD            <> '44'                                                                                                                                                                 AND CSD.CUST_SEG_NBR IN (&user_defined_policy.)           
  and cdf.CSE_OPEN_DT_DIMID <= &end_yyyymmdd.            
  and cdf.CSE_CLOS_DT_DIMID >= &beg_yyyymmdd.
  and cdf.MEDCR_IND_DIMID   <> 1
group by cdf.CASE_ID, cdf.TOUCH_DT_DIMID;
quit;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 05-02-2020 08:03 PM

Oh my goodness, table csd is used. Except the AND clause is in column 200!

Do yourself a favour and format your code properly!

So it looks like table csd should be an inner join too?

So maybe try

cdf

inner join

(most discriminating match, rda or csd?)

inner join

(rda or csd)

left join

af

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 05-03-2020 03:04 AM

Please supply some additional information.

Are these libraries local SAS libraries, or do you connect to a remote database?

How many observations do the datasets have, with regard to the subsets you do?

What are the attributes of the variables you pull (types, lengths)?

You also have stupid storage of dates. Fix that before you even continue.

This

input(cdf.case_id,best12.) as mbr_pgm_id

is also bad, keep id's in character.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

nirupama1 · Posted 05-03-2020 07:11 AM

These are local sas libraries. I know it is not appropriately written. The conditions are fine, it generates the desired output.
The actv_fac table is very huge (records in millions).
There is only need to minimize time.
I think the issue is with joins. I tried multiple ways but the result is not the same as the original query.

nirupama1 · Posted 05-03-2020 07:17 AM

Cse_dtl_fact and actv_fac both are very huge tables

ChrisNZ · Posted 05-03-2020 07:28 AM

Have you tried inner joins?

What # of records do you anticipate from each table before and after the joins?

Are the tables sorted? indexed?

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 05-03-2020 07:39 AM

@nirupama1 wrote:
Cse_dtl_fact and actv_fac both are very huge tables

What you call huge might be a little flyspeck for someone else. Please give us the numbers. Not "millions", but 5 millions, 10 millions, and so on. And the sizes of the required variables.

e. g. how many observations of ACTV_FACT satisfy the condition

CSE_SRC_SYS_CD in ('CPM')

and what are the attributes of CASE_ID and ACTV_ID?

This will determine if and where we can use in-memory techniques.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

nirupama1 · Posted 05-03-2020 08:11 AM

6 million observations fall under CSE_SRC_SYS_CD in ('CPM') along with where condition.
If I run the original query by simply removing join with the ref_dim table and where condition of ref_dim table, the output is 55k records.

nirupama1 · Posted 05-03-2020 08:14 AM

And this run within seconds but as soon as I join with ref_dim table, it executes in about 2days or more and output results in between13k-18k.

Kurt_Bremser · Posted 05-03-2020 08:16 AM

And what are the lengths of case_id, actv_id and actv_rsltn_otcome_dimid?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

nirupama1 · Posted 05-03-2020 08:37 AM

I am not sure what exactly u mean by length....
These ids are of 9-10 characters length.
case_id=9charactets
actv_id =10 characters
actv_rsltn_otcome_dimid=maybe negative number or 5-6 digits

Kurt_Bremser · Posted 05-03-2020 08:40 AM

Maxim 3: Know Your Data.

Run PROC CONTENTS on your tables, so you get to know the types and other attributes of your columns.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

nirupama1 · Posted 05-03-2020 08:51 AM

Length
case_id=30
actv_id =30
actv_rsltn_otcome_dimid=8

Kurt_Bremser · Posted 05-03-2020 09:44 AM

So this table would need roughly 6 million * 70 bytes in memory, less than 500 MB.

Let's start with the first join:

data call;
set hmaodm.cse_dtl_fact;
if _n_ = 1
then do;
  length
    actv_id $30
    actv_rsltn_otcome_dimid 8
  ;
  declare hash af (dataset:"hmaodm.actv_fact (where=(cse_src_sys_cd in ('CPM')))");
  af.definekey("case_id");
  af.definedata("actv_id","actv_rsltn_otcome_dimid");
  af.definedone();
  call missing(actv_id,actv_rsltn_otcome_dimid);
end;
rc = af.find();
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away