DATA Step, Macro, Functions and more

DATA step: Using IF-THEN versus AND

Reply
Contributor
Posts: 36

DATA step: Using IF-THEN versus AND

I ran across a paper on SAS programming efficiencies from 1999 (http://www.ssc.wisc.edu/sscc/pubs/4-3.pdf). One of the tips (#10) is to use a series of IF-THEN clauses instead of compund expressions with AND. The reason being that the Data step will stop evaluating the If-Then clauses as soon as one is false, and that the Data step will evaluate all of the AND conditions even if one is false. A colleague says the DATA step will stop evaluating as soon as a condition is false in both instances. Can anyone verify which is correct (the paper or my colleague)? It probably doesn't make a difference for small datasets, but I run programs that perform multiple searches like this on large datasets every week, and would like to minimize the time it takes for the programs to run. Thanks!

data needle;

set haystack;

if test1=0 and test2=1 and test3='straw' then output;

versus

data needle;

set haystack;

if test1=0

     then if test2=1

          then if test3='straw'

then output;

Super User
Posts: 5,099

DATA step: Using IF-THEN versus AND

Michelle,

These sorts of things occasionally change with new releases of the software (and to a lesser extent with the operating system).  The best thing to do would be to test it yourself.  It's really not that difficult.  For example, run this step 3 times and note the total run time:

data _null_;

test1=9;

test2=9;

test3='sheep';

do i=1 to 10000000;

    if test1=0 and test2=1 and test3='straw' then x=5;

end;

run;

You can vary the truth or falsity of the conditions, and the methods you use to test them.  Also, with a series of AND conditions, don't assume that the software checks them from left to right.  If you check the first condition being the only false one, also check what happens when the last condition is the only false one.

Let us know what you find.

PROC Star
Posts: 7,366

DATA step: Using IF-THEN versus AND

Michelle,

I would have thought that the paper was correct but, based on the three tests I just ran, I have to agree with your colleague.  I think that the three tests approximate the question you are asking and, interestingly, test1 perfromed the fastest in all of my tests (although, admittedly, I only ran two tests Smiley Happy):

data somedata;

  set sashelp.class;

  do i=1 to 1000000;

    output;

  end;

run;

data test1;

  set somedata;

  if 1 ne 1 and age gt 0 and age gt 0 and age gt 0 and

   age gt 0 and age gt 0 and age gt 0 and age gt 0 and

   age gt 0 and age gt 0 and age gt 0 and age gt 0 and

   age gt 0 and age gt 0 and age gt 0 and age gt 0 and

   age gt 0 and age gt 0 and age gt 0 and age gt 0 then

    check=0;

  else check=1;

run;

data test2;

  set somedata;

  if 1 ne 1 then if age gt 0 then if age gt 0 then if age gt 0 then if

   age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

   age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

   age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

   age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then

    check=0;

  else check=1;

run;

data test3;

  set somedata;

  if 1 ne 1 then do;

   if age gt 0 then if age gt 0 then if age gt 0 then if

   age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

   age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

   age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

   age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then

    check=0;

  end;

  else check=1;

run;

Respected Advisor
Posts: 3,777

Re: DATA step: Using IF-THEN versus AND

I think your test is "too noisy" with all the IO.  Still takes 1e10 iterations to show any difference and it is very small. "myth busted".

300 options fullstimer=1;

301 %let reps=1e10;

302 data _null_;

303 set sashelp.class;

304 do _n_ = 1 to &reps;

305 if 1 ne 1 and age gt 0 and age gt 0 and age gt 0 and

306 age gt 0 and age gt 0 and age gt 0 and age gt 0 and

307 age gt 0 and age gt 0 and age gt 0 and age gt 0 and

308 age gt 0 and age gt 0 and age gt 0 and age gt 0 and

309 age gt 0 and age gt 0 and age gt 0 and age gt 0 then

310 check=0;

311 else check=1;

312 end;

313 stop;

314 run;

NOTE: There were 1 observations read from the data set SASHELP.CLASS.

NOTE: DATA statement used (Total process time):

real time 48.78 seconds

user cpu time 48.36 seconds

system cpu time 0.01 seconds

Memory 211k

OS Memory 7024k

Timestamp 3/26/2012 1:07:26 PM

315

316

317 data _null_;

318 set sashelp.class;

319 do _n_ = 1 to &reps;

320 if 1 ne 1 then if age gt 0 then if age gt 0 then if age gt 0 then if

321 age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

322 age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

323 age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

324 age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then

325 check=0;

326 else check=1;

327 end;

328 stop;

329 run;

NOTE: There were 1 observations read from the data set SASHELP.CLASS.

NOTE: DATA statement used (Total process time):

real time 44.42 seconds

user cpu time 43.97 seconds

system cpu time 0.00 seconds

Memory 209k

OS Memory 7024k

Timestamp 3/26/2012 1:08:10 PM

330

331

332

333 data test3;

334 set sashelp.class;

335 do _n_ = 1 to &reps;

336 if 1 ne 1 then do;

337 if age gt 0 then if age gt 0 then if age gt 0 then if

338 age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

339 age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

340 age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then if

341 age gt 0 then if age gt 0 then if age gt 0 then if age gt 0 then

342 check=0;

343 end;

344 else check=1;

345 end;

346 stop;

347 run;

NOTE: There were 1 observations read from the data set SASHELP.CLASS.

NOTE: The data set WORK.TEST3 has 0 observations and 6 variables.

NOTE: DATA statement used (Total process time):

real time 47.36 seconds

user cpu time 47.15 seconds

system cpu time 0.03 seconds

Memory 215k

OS Memory 7024k

Timestamp 3/26/2012 1:08:57 PM

Super User
Super User
Posts: 6,502

DATA step: Using IF-THEN versus AND

You also need to consider whether some of the conditions have side effects when they are evaluated.

For example since it returns a code you could include a FILENAME() function call in an IF statement.

So depending on how you structure your code the FILENAME statement may or may not execute.

Compare this:

x=0;

if x and 0=filename(fileref,'xxxx') then put 'YES'; else put 'NO';

put x= fileref=;

to this:

x=0;

if x then do;

  if 0=filename(fileref,'xxxx') then put 'YES'; else put 'NO';

end;

else put 'NO';

put x= fileref=;

Super User
Posts: 5,099

DATA step: Using IF-THEN versus AND

Michelle,

One more thing to consider.  IF/THEN statements are relatively quick.  As some of the test results have shown, your savings will be minimal even if you find the quickest method.  If you would be willing to share more of your code, chances are the posters here could recommend ways to save dramatically more by bringing up additional techniques to apply.

Good luck.

Contributor
Posts: 36

Re: DATA step: Using IF-THEN versus AND

I ran the tests suggested by Astounding, and for the most part,  'AND' and 'THEN IF' had the same times, regardless of which conditions were true/false.

I also ran the tests Art297 suggested, but subsitituted one of my own datasets. I tried varying which version went first, and don't see any obvious winner. The times were much more variable across the board. It seemed that usually whichever was the first version to run was the slowest (but not always).

Most of the code I use is some variation of searching for observations in a database of injuries related to consumer products, looking for errors in the coding of different variables. For example, I might look for observations mentioning push toys in a text field (cmt)  but aren't coded as involving push toys (prod=1326):

data errors;

   set all;

   if prod^=1326  

     and age<=12   

     and ((index(cmt,'PUSH') or index(narrative,'POPPER')) and index(cmt,'TOY'))

     and ^index(narrative,'TOYOT')

     then output;

The index search of the comment text field can get pretty funky with all the and's, or's & not's, but most of it is not very complicated. It just gets time consuming running everything together. So, if there are any best practices to structure the query and minimize time, I would love to know! I am not a programmer and relatively new to SAS. Thanks.

Super User
Posts: 5,099

Re: DATA step: Using IF-THEN versus AND

Michelle,

Thanks for posting the results.  These suggestions could improve performance, but by how much?  It depends on attributes of your data.

data errors;

set all;

where prod ^= 1326 and age <= 12;

* Order the IF conditions by frequency of occurrence, highest frequency first.  For example:  ;

if index(narrative, 'TOYOT') then delete;

if index(cmt, 'TOY');

if index(cmt, 'PUSH') or index(narrative, 'POPPER');

run;

Breaking this up into multiple statements is not a big help ... I'm only doing it because it helps you place the most frequent condition first in the list.  Same for switching to subsetting IF instead of OUTPUT.  That's a matter of style in this case, but not an efficiency consideration.

It's a good observation that the first test seems to run the slowest.  SAS evidently performs some set-up work behind the scenes ... things we can't really observe directly.

Contributor
Posts: 36

Re: DATA step: Using IF-THEN versus AND

Whoa, I am astounded... at my own ignorance... :smileyblush:  It never occurred to me that WHERE and IF could both be used in one data step (I inherited most of these programs and don't recall any sections that use both in one step)... Looking back at my SAS Programming I book, I do find two slides that show this. Thanks, I will try this out. I suspect you are right (it won't make a difference in most of the code), but there are a few particularly ugly and sluggish sections of code I will try this on.

PROC Star
Posts: 7,366

Re: DATA step: Using IF-THEN versus AND

Michelle, DataNull, as always, was correct in that my proposed test had too much noise.

For one, it used tests of the same variable multiple times.  At least with the where statement, SAS evaluates that and reduces the statement to its simplest form.  I presume that the if statements were doing the same thing.

As such, I changed the test and included various combinations of where and if.

The results:  Nested do statements, whether in the form of if thendo if then do, or if then do;end;etc.etc. performed the WORST.

Still not the best of tests, but I think it definitely supports throwing away the myth.  Try the following:

data somedata;

  set sashelp.class (rename=(age=age_in));

  array age(19);

  do i=1 to 19;

    age(i)=age_in;

  end;

  do i=1 to 20000;

    output;

  end;

run;

 

data test1;

  set somedata;

  if name eq 'Joyce' and age1 lt 0 and age2 gt 0 and age3 gt 0 and

   age4 gt 0 and age5 gt 0 and age6 gt 0 and age7 gt 0 and

   age8 gt 0 and age9 gt 0 and age10 gt 0 and age11 gt 0 and

   age12 gt 0 and age13 gt 0 and age14 gt 0 and age15 gt 0 and

   age16 gt 0 and age17 gt 0 and age18 gt 0 and age19 gt 0;

run;

data test2;

  set somedata (where=(name eq 'Joyce'));

  if age1 lt 0 and age2 gt 0 and age3 gt 0 and

   age4 gt 0 and age5 gt 0 and age6 gt 0 and age7 gt 0 and

   age8 gt 0 and age9 gt 0 and age10 gt 0 and age11 gt 0 and

   age12 gt 0 and age13 gt 0 and age14 gt 0 and age15 gt 0 and

   age16 gt 0 and age17 gt 0 and age18 gt 0 and age19 gt 0;

run;

data test3;

  set somedata;

  where name eq 'Joyce';

  if age1 lt 0 and age2 gt 0 and age3 gt 0 and

   age4 gt 0 and age5 gt 0 and age6 gt 0 and age7 gt 0 and

   age8 gt 0 and age9 gt 0 and age10 gt 0 and age11 gt 0 and

   age12 gt 0 and age13 gt 0 and age14 gt 0 and age15 gt 0 and

   age16 gt 0 and age17 gt 0 and age18 gt 0 and age19 gt 0;

run;

data test4;

  set somedata;

  where name eq 'Joyce'

   and age1 lt 0 and age2 gt 0 and age3 gt 0 and

   age4 gt 0 and age5 gt 0 and age6 gt 0 and age7 gt 0 and

   age8 gt 0 and age9 gt 0 and age10 gt 0 and age11 gt 0 and

   age12 gt 0 and age13 gt 0 and age14 gt 0 and age15 gt 0 and

   age16 gt 0 and age17 gt 0 and age18 gt 0 and age19 gt 0;

run;

data test5;

  set somedata;

  if name eq 'Joyce'

   then if age1 lt 0

   then if age2 gt 0

   then if  age3 gt 0

   then if age4 gt 0

   then if age5 gt 0

   then if age6 gt 0

   then if age7 gt 0

   then if age8 gt 0

   then if age9 gt 0

   then if age10 gt 0

   then if age11 gt 0

   then if age12 gt 0

   then if age13 gt 0

   then if age14 gt 0

   then if age15 gt 0

   then if age16 gt 0

   then if age17 gt 0

   then if age18 gt 0

   then if age19 gt 0;

run;

data test6;

  set somedata;

  if name eq 'Joyce' then do;

   if age1 lt 0 then do;

    if age2 gt 0 then do;

     if  age3 gt 0 then do;

      if age4 gt 0 then do;

       if age5 gt 0 then do;

        if age6 gt 0 then do;

         if age7 gt 0 then do;

          if age8 gt 0 then do;

           if age9 gt 0 then do;

            if age10 gt 0 then do;

             if age11 gt 0 then do;

              if age12 gt 0 then do;

               if age13 gt 0 then do;

                if age14 gt 0 then do;

                 if age15 gt 0 then do;

                  if age16 gt 0 then do;

                   if age17 gt 0 then do;

                    if age18 gt 0 then do;

                     if age19 gt 0 then do;

                     end;

                    end;

                   end;

                  end;

                 end;

                end;

               end;

              end;

             end;

            end;

           end;

          end;

         end;

        end;

       end;

      end;

     end;

    end;

   end;

  end;

run;

Super User
Posts: 5,099

DATA step: Using IF-THEN versus AND

Michelle,

I think I posted a little hastily.  The entire IF statement could become a WHERE statement (if you remove the "THEN OUTPUT" part).  WHERE knows how to apply the INDEX function.

One last item to consider is that it may be possible to eliminate some of the INDEX functions.  It depends on what the searched fields actually contain.  Compare these two:

index(cmt, 'TOY')

cmt =: 'TOY'

INDEX searches through all of CMT, looking for TOY.  But =: checks just the first three characters to see if they are TOY.  That will be much faster, but is it sufficient?  You may have cases where the string you are looking for would always appear at the beginning of a variable (if it appears at all). 

Good luck.

Ask a Question
Discussion stats
  • 10 replies
  • 983 views
  • 1 like
  • 5 in conversation