Efficient coding

Reply
Contributor
Posts: 26

Efficient coding

Hi

i have a piece of code that takes on average 2hrs two run due to the size of the data.

below is the code sample.

is there more better way i can code this.

e.g.

data Data_set_name;

set Data_set_name;

where channel = 'PHONE'

or channel = 'BRN'

or channel = 'BRANCH'

or channel = 'FD'

or channel = 'OKS'

or channel = 'FFI'

or channel = 'FIRSTDIRECT'

or channel = 'OCC'

or channel = 'OTHERCALLCENTRE'

or channel = 'FPAS'

and contacted = "Y"

and Direct_Marketing = "Y";

keep id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing;

run;

Regular Contributor
Posts: 216

Re: Efficient coding

Posted in reply to Allistair

Allistair,

Try this modified version of your data step. It only reads the columns you require into the PDV, and uses shorter where clause condition.

DATA Data_set_name;

    SET Data_set_name(KEEP=id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing

                    WHERE=( channel IN ('PHONE','BRN','BRANCH','FD','OKS','FFI','FIRSTDIRECT','OCC','OTHERCALLCENTRE','FPAS')

                            AND contacted = "Y"

                            AND Direct_Marketing = "Y"));

RUN;

Please keep in mind, performance enhancements do not rely on code modifications only!! Sometimes additional changes will be required in order to enhance response times, such as

1. Adding an index for the channel column. (If contacted and Direct_Marketing only contain Y/N values, then no index required)

2. The Physical location of the data set.

     2.1 Network Access --> Network bandwidth limitation

     2.2 Database Access --> Changes into the Read Buffer size

     2.3 Disk --> Change stripe widths of 64 or 128K

Check the following blog, it includes links for additional insight into resolving performance issues

Improving performance: Determine the cause

Good luck,

Ahmed

Trusted Advisor
Posts: 1,137

Re: Efficient coding

Posted in reply to Allistair

Please try

data Data_set_name;

set Data_set_name(keep = id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing);

where prxmatch('/PHONE|BRN|BRANCH|FD|OKS|FFI|FIRSTDIREC|FIRSTDIRECT|OCC|OTHERCALLCENTRE|FPAS/',channel)>0

and contacted='Y' and Direct_Marketing = "Y";

run;

Thanks,

Jag

Thanks,
Jag
Super User
Posts: 5,497

Re: Efficient coding

Posted in reply to Allistair

Several recommendations that might allow the WHERE statement to evaluate faster.

Examine the percentage of observations that satisfy each condition in the WHERE statement.  Reorder the conditions so that the WHERE statement can make a decision as early as possible.  For example, if only 5% of the observations meet the condition CONTACTED="Y", place that first:

where contacted="Y" and

(channel='PHONE' or ... or channel='FPAS')

and Direct_Marketing="Y";

Next, add those parentheses around the CHANNEL values.  When you don't, it gets confusing to interpret and to be confident that you are asking for the right set of observations.

Finally, within the list of CHANNEL values, order them from MOST to LEAST frequent occurrence.  Again, that let's the WHERE statement find a match as early as possible so it doesn't have to check the remaining values.

In summary, AND conditions should be ordered from least likely to most likely, but OR conditions should be ordered from most likely to least likely.

Good luck.

PROC Star
Posts: 7,467

Re: Efficient coding

Posted in reply to Allistair

Allistair,

I agree with both of Astounding recommendations and the other that appeared in the other two recommendations I saw (i.e., only keeping the data you need).

That last one will be most effective if you aren't keeping a lot of variables.  I just ran a test case where I was dropping a 1,000 variables from the run and it, alone, caused a 75% reduction in running time.

However, you will also need to include contacted in your keep statement, but then drop it in your data statement.  i.e.:

data Data_set_name4 (drop=contacted);

  set Data_set_name (keep=id_no cust_no cust_cde dataset_name

      date_end channel description purpose prod_hse credit_campaign

      Direct_Marketing contacted);

Super Contributor
Posts: 387

Re: Efficient coding

Posted in reply to Allistair

I agree with the above, and add:

1) +1 for the suggestion to use indexes.

2) We've seen good performance gains by using a "SPDE Work" library.  There are some caveats you need to be aware of (see the doc).  But, it's really easy to implement:

libname spdework spde "%sysfunc(pathname(work))" temp=yes;

data spdework.Data_set_name4 (drop=contacted);

  set spdework.Data_set_name (keep=id_no cust_no cust_cde dataset_name

      date_end channel description purpose prod_hse credit_campaign

      Direct_Marketing contacted);

I'm assuming your source data is a work dataset since it has a one-level name.

You can simplify this further by using the options user=spdework; statement.  This redirects one level names to the specified library.

Respected Advisor
Posts: 4,173

Re: Efficient coding

Posted in reply to Allistair

On top of all the good advice already given:

- Use option "fullstimer" so you can see in the log where you're actually spending the time SAS(R) 9.4 Language Reference: Concepts, Second Edition

- Have permanent datasets compressed to decrease disk I/O.

- If the source data set is stored in a data base: Code your where clause in a way that it gets fully pushed to the data base for execution. Use "OPTIONS SASTRACE=',,,d' SASTRACELOC=SASLOG NOSTSUFFIX;" to see in the log what part of processing actually gets sent to the DB.

If analysing the log using "fullstimer" tells you it's about I/O then the following will be of interest to you:

SAS(R) 9.4 Language Reference: Concepts, Second Edition

Frequent Contributor
Posts: 118

Re: Efficient coding

Posted in reply to Allistair

Hi,

Please check this one. this could also help.

https://communities.sas.com/message/172882#172882.

Regards

Uma Shanker Saini

Super User
Posts: 5,424

Re: Efficient coding

Posted in reply to Allistair

I never noticed that the order in the where cleasu is affecting performance. What is, is the use of index or not. Whether is optimal to create index or not, it is its own subject area which is discussed in each each project I would say... If you decide to have indexes, and your data is in SAS, I would consider SPDE since it handles indexes much more efficient than the Base SAS engine (especially creating and updating).

Then I wonder if your where cleuse is doing what you want, AND ties "harder" than OR, so your logic contacted = "Y" and Direct_Marketing = "Y" is only evaluated for channel FPAS, is that what you want?

Otherwise, replace the whole OR block with an IN(), easier to code, and perhaps avoids simple coding errors...

Data never sleeps
Super User
Posts: 5,497

Re: Efficient coding

LinusH,

I ran a few tests to explore results of changing order in the WHERE statement.  I was able to confirm that the order of AND conditions makes a difference, but not the order of OR conditions.  Here's one program I used:


data test;
   y=5;
   do x=1 to 10000000;
      output;
   end;
   retain var3 - var10 0;
run;

data test2;
   set test;
   where y < 0 and x > 0;
run;

data test3;
   set test;
   where x > 0 and y < 0;
run;

In real life, I ran both DATA steps several times to confirm that the CPU time is different.  In the case of OR conditions, I suspect that SAS changes them internally, converting them to an IN operator.  But I don't think I found the whole story on that.

Of course, the differences will be small because the WHERE statement accounts for just a small percentage of the work in a DATA step.

Super User
Posts: 5,424

Re: Efficient coding

Posted in reply to Astounding

Yeah, you are correct. And the difference on CPU time is about 20%, so it does have some impact.

I guess that since these variables are not indexed, and therefore have no stats of the contents in the variable, SAS can't guess in what order it should evaluate the where clause. So this is a clear differentiation to RDBMS where you usually have the possibility to calculate statistics for a table, indexed or not.

Data never sleeps
Occasional Contributor
Posts: 19

Re: Efficient coding

Posted in reply to Allistair

Hi! I agree with Linus!    The WHERE-statements below are exactly the same:

where channel = 'PHONE'    or channel = 'BRN'     or channel = 'BRANCH'     or channel = 'FD'    or channel = 'OKS'

or channel = 'FFI'     or channel = 'FIRSTDIRECT'     or channel = 'OCC'     or channel = 'OTHERCALLCENTRE'

or channel = 'FPAS'    and contacted = "Y"    and Direct_Marketing = "Y";

and

where channel = 'PHONE'    or channel = 'BRN'    or channel = 'BRANCH'    or channel = 'FD'    or channel = 'OKS'

or channel = 'FFI'    or channel = 'FIRSTDIRECT'    or channel = 'OCC'    or channel = 'OTHERCALLCENTRE'

or (channel = 'FPAS'    and contacted = "Y"    and Direct_Marketing = "Y");

Then of course it all depends on "todays data" if it shows up "today", tomorrow, or some other day.To verify that the statements have EXACTLY the same meaning by performing tests, requires a test data set which "spans all the different values". NOT AT ALL easy to build!

Anders Sköllermo
Ph.D., Swedish Actuary

"Both Age and IQ are 69+"
"Retired, But Not Tired"

Sandgränd 13, S-178 40 Ekerö
email: anders.skollermo@one.se
tele: 00468968419 mobil: 0046735077373
PROC Star
Posts: 7,467

Re: Efficient coding

Yes they are the same, but I really have to wonder if the OP realized that?  I.e., I think that the OP has to clarify whether he/she was really trying to accomplish:

where channel = 'PHONE'    or channel = 'BRN'     or channel = 'BRANCH'     or channel = 'FD'    or channel = 'OKS'

or channel = 'FFI'     or channel = 'FIRSTDIRECT'     or channel = 'OCC'     or channel = 'OTHERCALLCENTRE'

or channel = 'FPAS'    and contacted = "Y"    and Direct_Marketing = "Y";

or

where (channel = 'PHONE'    or channel = 'BRN'     or channel = 'BRANCH'     or channel = 'FD'    or channel = 'OKS'

or channel = 'FFI'     or channel = 'FIRSTDIRECT'     or channel = 'OCC'     or channel = 'OTHERCALLCENTRE'

or channel = 'FPAS')    and contacted = "Y"    and Direct_Marketing = "Y";

Occasional Contributor
Posts: 19

Re: Efficient coding

Hi! I agree completely!

Anders Sköllermo
Ph.D., Swedish Actuary

"Both Age and IQ are 69+"
"Retired, But Not Tired"

Sandgränd 13, S-178 40 Ekerö
email: anders.skollermo@one.se
tele: 00468968419 mobil: 0046735077373
Super User
Super User
Posts: 7,039

Re: Efficient coding

Posted in reply to Allistair

Watch out because AND is evaluated before OR in the order of operations precedence for SAS.   SAS Operators : Order of Evaluation of SAS Operators

Note that SAS will show you how it has interpreted your WHERE condition in the notes.

52   data one ;

53    set test;

54    where c=1 or c=2 or c=3 and a and b;

55   run;

NOTE: There were 9 observations read from the data set WORK.TEST.

      WHERE c in (1, 2) or ((c=3) and a and b);

56   data two;

57    set test;

58    where a and b or c=1 or c=2 or c=3 ;

59   run;

NOTE: There were 14 observations read from the data set WORK.TEST.

      WHERE (a and b) or c in (1, 2, 3);

Ask a Question
Discussion stats
  • 19 replies
  • 694 views
  • 6 likes
  • 12 in conversation