BookmarkSubscribeRSS Feed
EMC9
Obsidian | Level 7

Hi

i have a piece of code that takes on average 2hrs two run due to the size of the data.

below is the code sample.

is there more better way i can code this.

e.g.

data Data_set_name;

set Data_set_name;

where channel = 'PHONE'

or channel = 'BRN'

or channel = 'BRANCH'

or channel = 'FD'

or channel = 'OKS'

or channel = 'FFI'

or channel = 'FIRSTDIRECT'

or channel = 'OCC'

or channel = 'OTHERCALLCENTRE'

or channel = 'FPAS'

and contacted = "Y"

and Direct_Marketing = "Y";

keep id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing;

run;

19 REPLIES 19
AhmedAl_Attar
Rhodochrosite | Level 12

Allistair,

Try this modified version of your data step. It only reads the columns you require into the PDV, and uses shorter where clause condition.

DATA Data_set_name;

    SET Data_set_name(KEEP=id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing

                    WHERE=( channel IN ('PHONE','BRN','BRANCH','FD','OKS','FFI','FIRSTDIRECT','OCC','OTHERCALLCENTRE','FPAS')

                            AND contacted = "Y"

                            AND Direct_Marketing = "Y"));

RUN;

Please keep in mind, performance enhancements do not rely on code modifications only!! Sometimes additional changes will be required in order to enhance response times, such as

1. Adding an index for the channel column. (If contacted and Direct_Marketing only contain Y/N values, then no index required)

2. The Physical location of the data set.

     2.1 Network Access --> Network bandwidth limitation

     2.2 Database Access --> Changes into the Read Buffer size

     2.3 Disk --> Change stripe widths of 64 or 128K

Check the following blog, it includes links for additional insight into resolving performance issues

Improving performance: Determine the cause

Good luck,

Ahmed

Jagadishkatam
Amethyst | Level 16

Please try

data Data_set_name;

set Data_set_name(keep = id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing);

where prxmatch('/PHONE|BRN|BRANCH|FD|OKS|FFI|FIRSTDIREC|FIRSTDIRECT|OCC|OTHERCALLCENTRE|FPAS/',channel)>0

and contacted='Y' and Direct_Marketing = "Y";

run;

Thanks,

Jag

Thanks,
Jag
Astounding
PROC Star

Several recommendations that might allow the WHERE statement to evaluate faster.

Examine the percentage of observations that satisfy each condition in the WHERE statement.  Reorder the conditions so that the WHERE statement can make a decision as early as possible.  For example, if only 5% of the observations meet the condition CONTACTED="Y", place that first:

where contacted="Y" and

(channel='PHONE' or ... or channel='FPAS')

and Direct_Marketing="Y";

Next, add those parentheses around the CHANNEL values.  When you don't, it gets confusing to interpret and to be confident that you are asking for the right set of observations.

Finally, within the list of CHANNEL values, order them from MOST to LEAST frequent occurrence.  Again, that let's the WHERE statement find a match as early as possible so it doesn't have to check the remaining values.

In summary, AND conditions should be ordered from least likely to most likely, but OR conditions should be ordered from most likely to least likely.

Good luck.

art297
Opal | Level 21

Allistair,

I agree with both of Astounding recommendations and the other that appeared in the other two recommendations I saw (i.e., only keeping the data you need).

That last one will be most effective if you aren't keeping a lot of variables.  I just ran a test case where I was dropping a 1,000 variables from the run and it, alone, caused a 75% reduction in running time.

However, you will also need to include contacted in your keep statement, but then drop it in your data statement.  i.e.:

data Data_set_name4 (drop=contacted);

  set Data_set_name (keep=id_no cust_no cust_cde dataset_name

      date_end channel description purpose prod_hse credit_campaign

      Direct_Marketing contacted);

ScottBass
Rhodochrosite | Level 12

I agree with the above, and add:

1) +1 for the suggestion to use indexes.

2) We've seen good performance gains by using a "SPDE Work" library.  There are some caveats you need to be aware of (see the doc).  But, it's really easy to implement:

libname spdework spde "%sysfunc(pathname(work))" temp=yes;

data spdework.Data_set_name4 (drop=contacted);

  set spdework.Data_set_name (keep=id_no cust_no cust_cde dataset_name

      date_end channel description purpose prod_hse credit_campaign

      Direct_Marketing contacted);

I'm assuming your source data is a work dataset since it has a one-level name.

You can simplify this further by using the options user=spdework; statement.  This redirects one level names to the specified library.


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Patrick
Opal | Level 21

On top of all the good advice already given:

- Use option "fullstimer" so you can see in the log where you're actually spending the time SAS(R) 9.4 Language Reference: Concepts, Second Edition

- Have permanent datasets compressed to decrease disk I/O.

- If the source data set is stored in a data base: Code your where clause in a way that it gets fully pushed to the data base for execution. Use "OPTIONS SASTRACE=',,,d' SASTRACELOC=SASLOG NOSTSUFFIX;" to see in the log what part of processing actually gets sent to the DB.

If analysing the log using "fullstimer" tells you it's about I/O then the following will be of interest to you:

SAS(R) 9.4 Language Reference: Concepts, Second Edition

umashankersaini
Quartz | Level 8

Hi,

Please check this one. this could also help.

https://communities.sas.com/message/172882#172882.

Regards

Uma Shanker Saini

LinusH
Tourmaline | Level 20

I never noticed that the order in the where cleasu is affecting performance. What is, is the use of index or not. Whether is optimal to create index or not, it is its own subject area which is discussed in each each project I would say... If you decide to have indexes, and your data is in SAS, I would consider SPDE since it handles indexes much more efficient than the Base SAS engine (especially creating and updating).

Then I wonder if your where cleuse is doing what you want, AND ties "harder" than OR, so your logic contacted = "Y" and Direct_Marketing = "Y" is only evaluated for channel FPAS, is that what you want?

Otherwise, replace the whole OR block with an IN(), easier to code, and perhaps avoids simple coding errors...

Data never sleeps
Astounding
PROC Star

LinusH,

I ran a few tests to explore results of changing order in the WHERE statement.  I was able to confirm that the order of AND conditions makes a difference, but not the order of OR conditions.  Here's one program I used:


data test;
   y=5;
   do x=1 to 10000000;
      output;
   end;
   retain var3 - var10 0;
run;

data test2;
   set test;
   where y < 0 and x > 0;
run;

data test3;
   set test;
   where x > 0 and y < 0;
run;

In real life, I ran both DATA steps several times to confirm that the CPU time is different.  In the case of OR conditions, I suspect that SAS changes them internally, converting them to an IN operator.  But I don't think I found the whole story on that.

Of course, the differences will be small because the WHERE statement accounts for just a small percentage of the work in a DATA step.

LinusH
Tourmaline | Level 20

Yeah, you are correct. And the difference on CPU time is about 20%, so it does have some impact.

I guess that since these variables are not indexed, and therefore have no stats of the contents in the variable, SAS can't guess in what order it should evaluate the where clause. So this is a clear differentiation to RDBMS where you usually have the possibility to calculate statistics for a table, indexed or not.

Data never sleeps
AndersS
Pyrite | Level 9

Hi! I agree with Linus!    The WHERE-statements below are exactly the same:

where channel = 'PHONE'    or channel = 'BRN'     or channel = 'BRANCH'     or channel = 'FD'    or channel = 'OKS'

or channel = 'FFI'     or channel = 'FIRSTDIRECT'     or channel = 'OCC'     or channel = 'OTHERCALLCENTRE'

or channel = 'FPAS'    and contacted = "Y"    and Direct_Marketing = "Y";

and

where channel = 'PHONE'    or channel = 'BRN'    or channel = 'BRANCH'    or channel = 'FD'    or channel = 'OKS'

or channel = 'FFI'    or channel = 'FIRSTDIRECT'    or channel = 'OCC'    or channel = 'OTHERCALLCENTRE'

or (channel = 'FPAS'    and contacted = "Y"    and Direct_Marketing = "Y");

Then of course it all depends on "todays data" if it shows up "today", tomorrow, or some other day.To verify that the statements have EXACTLY the same meaning by performing tests, requires a test data set which "spans all the different values". NOT AT ALL easy to build!

Anders Sköllermo (Skollermo in English)
art297
Opal | Level 21

Yes they are the same, but I really have to wonder if the OP realized that?  I.e., I think that the OP has to clarify whether he/she was really trying to accomplish:

where channel = 'PHONE'    or channel = 'BRN'     or channel = 'BRANCH'     or channel = 'FD'    or channel = 'OKS'

or channel = 'FFI'     or channel = 'FIRSTDIRECT'     or channel = 'OCC'     or channel = 'OTHERCALLCENTRE'

or channel = 'FPAS'    and contacted = "Y"    and Direct_Marketing = "Y";

or

where (channel = 'PHONE'    or channel = 'BRN'     or channel = 'BRANCH'     or channel = 'FD'    or channel = 'OKS'

or channel = 'FFI'     or channel = 'FIRSTDIRECT'     or channel = 'OCC'     or channel = 'OTHERCALLCENTRE'

or channel = 'FPAS')    and contacted = "Y"    and Direct_Marketing = "Y";

AndersS
Pyrite | Level 9

Hi! I agree completely!

Anders Sköllermo (Skollermo in English)
Tom
Super User Tom
Super User

Watch out because AND is evaluated before OR in the order of operations precedence for SAS.   SAS Operators : Order of Evaluation of SAS Operators

Note that SAS will show you how it has interpreted your WHERE condition in the notes.

52   data one ;

53    set test;

54    where c=1 or c=2 or c=3 and a and b;

55   run;

NOTE: There were 9 observations read from the data set WORK.TEST.

      WHERE c in (1, 2) or ((c=3) and a and b);

56   data two;

57    set test;

58    where a and b or c=1 or c=2 or c=3 ;

59   run;

NOTE: There were 14 observations read from the data set WORK.TEST.

      WHERE (a and b) or c in (1, 2, 3);

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 1755 views
  • 6 likes
  • 12 in conversation