DATA Step, Macro, Functions and more

diffrence if and where in data step

Reply
Regular Contributor
Posts: 241

diffrence if and where in data step

Hello,

Would you please tell me what's the difference of a1 and a2?

Thanks

 

data a1;

set sashelp.class;

where sex ne 'M';

run;

data a2;

set sashelp.class;

if sex eq 'F';

run;

PROC Star
Posts: 325

Re: diffrence if and where in data step

[ Edited ]
Posted in reply to GeorgeSAS

There will be no difference in output, because sex has only 2 values. But if you variable which has more than 2 values, results will vary.

 For other rules, where both can be used or not, please look into below paper below

 

http://www2.sas.com/proceedings/sugi31/238-31.pdf

Trusted Advisor
Posts: 1,909

Re: diffrence if and where in data step

Posted in reply to GeorgeSAS

One of the differences is that different operators work in the WHERE statement, than in the IF statement. The one that comes to mind is the ? (or CONTAINS) operator, which work in the WHERE statement but not in the IF statement. There are probably other operators that are different, but I cannot think of any right now.

Super User
Posts: 5,424

Re: diffrence if and where in data step

Posted in reply to GeorgeSAS
Basically IF can deal with logic and work with data within the PDV.
WHERE does filtering only, and work with data prior it is loaded into the PDV. Therefore it is generally slighy faster than a corresponding subsetting IF, and it can utilize indexes.
Data never sleeps
Super User
Posts: 11,343

Re: diffrence if and where in data step

Posted in reply to GeorgeSAS

Where also only works on variables coming from a dataset referenced in the SET statement. For example:

data junk;
   set sashelp.class;
   junk= age*height;
   where junk > 5;
run;

will generate the following error:

 

97   data junk;
98      set sashelp.class;
99      junk= age*height;
100     where junk > 5;
ERROR: Variable junk is not on file SASHELP.CLASS.
101  run;

Where also evaluates the value of the variable on "read" of the Set dataset not any modified value.

 

data a1;
set sashelp.class;
if name= 'Alice' then call missing(sex);
where sex eq 'F';

run;

Will have Alice in the output because the value for Sex in Sashelp.class is 'F' even though it was modified before writing.

 

WHERE is a "declarative" statement type and position in the code doesn't matter similar to FORMAT, Label and several other statement, and IF is "executable" and the value of variables is considered at the the order of operations in the data step.

PROC Star
Posts: 7,467

Re: diffrence if and where in data step

[ Edited ]
Posted in reply to GeorgeSAS

@GeorgeSAS: Hopefully this won't confuse you, but a couple of the comments that were made inspired me to write and run some tests.

 

First, aside from the processing differences between using if and where, yes, they could end up producing different results.

Specifically, if the dataset contained any records where sex was missing, they would be selected by using ne 'M', but wouldn't be selected if using eq 'F'

 

Someone mentioned that the where statement should be faster, especially with indexed files.

 

Later in this post I'll show the code I ran but, basically, I compared using if (with eq 'F' and ne 'M'), the where statement using both conditions, and where as an option using both conditions. In all cases I used data _null_ statements so that the results wouldn't be clouded with output processing and competition with other users. Normally, I'd be principally concerned with CPU times but, in this case, the real time differences were conflicting with the CPU time results.

 

I compared all six conditions, 100 times (in random order), on a test file having 19 million records. I ran the test twice, once for an indexed file, and then for a non-indexed file.

 

Which had the fastest CPU times? Surprisingly, the two if statements, regardless of whether the test file was indexed or not.

Which had the fastest real times? The where option and where statement when set to eq 'F' .. both for indexed and non-indexed files.

 

Here is the code I ran:

 

%let printto_loc=/folders/myfolders/printo_loc;
data test;
  set sashelp.class;
  do _n_=1 to 1000000;
    output;
  end;
run;

data test (index= (sex));
  set sashelp.class;
  do _n_=1 to 1000000;
    output;
  end;
run;

data solutions;
  length solution $500.;
  input;
  solution=_infile_;
  cards4;
title 'where statement ne m';  data _null_;set test;where sex ne 'M';run;
title 'where statement eq f';  data _null_;set test;where sex eq 'F';run;
title 'if statement eq f';  data _null_;set test;if sex eq 'F';run;
title 'if statement ne m';  data _null_;set test;if sex ne 'M';run;
title 'where option ne m';  data _null_;set test (where =(sex ne 'M'));run;
title 'where option eq f';  data _null_;set test (where=(sex eq 'F'));run;
;;;;

%macro test;
  proc printto log="&printto_loc." new;
  run;
  %do i=1 %to 100;
    data solutions;
      set solutions;
      rand=uniform(0);
    run;

    proc sort data=solutions;
      by rand;
    run;

    data _null_;
      set solutions;
      call execute(solution);
    run;
  %end;
  
  proc printto;
  run;

%mend test;


%test

data compare;
  infile "&printto_loc.";
  informat method $23.;
  input @'title' method &
        @'real time' real
        @'cpu time' cpu;
/*   suggestion=compress(suggestion,'"'); */
run;

title 'Comparing CPU time on Non-Indexed File';
proc anova data=compare;
  class method;
  model cpu=method;
  means method / tukey;
run;

title 'Comparing REAL time on NonIndexed File';
proc anova data=compare;
  class method;
  model real=method;
  means method / tukey;
run;
 

I used the following code to create the indexed file:

data test (index= (sex));
  set sashelp.class;
  do _n_=1 to 1000000;
    output;
  end;
run;

 

Of course, others might have some critical concerns regarding my tests .. which is why I included the code.

 

Art, CEO, AnalystFinder.com

PROC Star
Posts: 1,322

Re: diffrence if and where in data step

Thanks @art297 .  That is interesting. 

 

In playing a bit, looks like if you select a small proportion of records, the WHERE starts to win.

 

So what's the conclusion from this?

 

We know that the IF statement requires reading data into the PDV, which the WHERE statement avoids.  But clearly the look ahead (if that is what it is) used by the WHERE statement to filter data before it is read into the PDV is also expensive.

 

Perhaps the cost-benefit of WHERE vs IF varies with I/O, proportion selected, proportion of variables that are NOT in the subsetting criteria, etc.

 

So the conclusion perhaps is when efficiency matters, test both?

PROC Star
Posts: 1,759

Re: diffrence if and where in data step

[ Edited ]
Posted in reply to GeorgeSAS

To add a few elements:

 

1- IF has access to some features (such as the _N_ or first. variables, or the vvalue() function) that WHERE cannot use.
   Likewise WHERE has access to some operators (such as contains or like) that IF cannot use.


2- The performance can be very different.
See the example below taken from https://communities.sas.com/t5/Base-SAS-Programming/Using-MERGE-with-LIKE-command/td-p/301405

%* CPU times and real times are equal;

options nofullstimer;

%* Create sample and load it in memory;

data A(compress=no); A='ssssssssssssssssss'; drop I; do I=1 to 1e8; output; end; run;

sasfile A load;

%* Select none;

data _null_; set A ( where=( A like '%X%' )) ; run; %*  5.8 seconds;

data _null_; set A ( where=( index(A,'X') )) ; run; %*  7.9 seconds;

data _null_; set A ( where=( find(A,'X') ))  ; run; %*  8.9 seconds;

data _null_; set A ; if find(A,'X')          ; run; %*  4.4 seconds;

%* Select all;

data _null_; set A ( where=( A like '%s%' )) ; run; %* 10.6 seconds;

data _null_; set A ( where=( index(A,'s') )) ; run; %* 13.8 seconds;

data _null_; set A ( where=( find(A,'s') ))  ; run; %* 14.0 seconds;

data _null_; set A ; if find(A,'s')          ; run; %*  3.4 seconds;

sasfile A close;

Also note how the IF statement becomes faster for the "select all" test since the string match happens right away, whereas the WHERE clauses become slower even though the string match also happens sooner. Not good!

It looks like SAS wastes time slow-reading, then assessing, then slow-loading the data.

Super User
Posts: 10,018

Re: diffrence if and where in data step

How ironical .

PROC Star
Posts: 1,322

Re: diffrence if and where in data step

That's interestijng @ChrisNZ.

 

I'm surprised at how much faster the IF statement is in that example.  Importantly, since the dataset only has one variable, I don't think there is much opportunity for the WHERE to be faster, but I wouldn't expect it to be that much slower.

 

I think it makes sense that the WHERE would be slower when it selects all, because in that case the step needs to do the WHERE filtering and then read every record into the PDV.  When it selects 0 record nothing is read into the PDV.  I wouldn't think the FIND() actually works differently on a WHERE clause vs IF statement.

 

Even if the variable A is just $1, you can still see the difference:

46   data A(compress=no); A='s'; drop I; do I=1 to 1e8; output; end; run;

NOTE: The data set WORK.A has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           1.94 seconds
      cpu time            1.96 seconds


47   sasfile A load;
NOTE: The file WORK.A.DATA has been loaded into memory by the SASFILE statement.
48
49   %* Select none;
50   data _null_; set A ( where=( find(A,'X') ))  ; run;

NOTE: There were 0 observations read from the data set WORK.A.
      WHERE FIND(A, 'X');
NOTE: DATA statement used (Total process time):
      real time           7.14 seconds
      cpu time            7.14 seconds


51   data _null_; set A ; if find(A,'X')          ; run;

NOTE: There were 100000000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
      real time           2.73 seconds
      cpu time            2.73 seconds


52
53   %* Select all;
54   data _null_; set A ( where=( find(A,'s') ))  ; run;

NOTE: There were 100000000 observations read from the data set WORK.A.
      WHERE FIND(A, 's');
NOTE: DATA statement used (Total process time):
      real time           12.02 seconds
      cpu time            12.02 seconds


55   data _null_; set A ; if find(A,'s')          ; run;

NOTE: There were 100000000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
      real time           2.76 seconds
      cpu time            2.76 seconds


56
57   sasfile A close;
NOTE: The file WORK.A.DATA has been closed by the SASFILE statement.

 

Then again, maybe find() is funky.  WIth the $1 version, swictched to just testing equality.  WHERE and IF basically tie except when the WHERE approach selects all records:

 

77   data _null_; set A ( where= (A='X') )  ; run;

NOTE: There were 0 observations read from the data set WORK.A.
      WHERE A='X';
NOTE: DATA statement used (Total process time):
      real time           2.04 seconds
      cpu time            2.04 seconds


78   data _null_; set A ; if (A='X')          ; run;

NOTE: There were 100000000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
      real time           2.26 seconds
      cpu time            2.26 seconds


79
80   %* Select all;
81   data _null_; set A ( where=( A='s') )  ; run;

NOTE: There were 100000000 observations read from the data set WORK.A.
      WHERE A='s';
NOTE: DATA statement used (Total process time):
      real time           6.58 seconds
      cpu time            6.63 seconds


82   data _null_; set A ; if (A='s')          ; run;

NOTE: There were 100000000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
      real time           2.18 seconds
      cpu time            2.18 seconds
PROC Star
Posts: 1,759

Re: diffrence if and where in data step

[ Edited ]

@Quentin Nothing special about FIND(), it's all functions I reckon:

options nofullstimer;
%* Create sample and load it in memory;
data A(compress=no); A='s'; drop I; do I=1 to 1e8; output; end; run;
sasfile A load;
%* Select none;
data _null_; set A ( where=( put(A,$1.)='X' )) ; run; %*  8.5 seconds;
data _null_; set A ; if      put(A,$1.)='X'    ; run; %*  3.1 seconds;
%* Select all;
data _null_; set A ( where=( put(A,$1.)='s' )) ; run; %* 20.0 seconds;
data _null_; set A ; if      put(A,$1.)='s'    ; run; %*  3.0 seconds;
sasfile A close;

WHERE is indeed faster if no function is used *and* most records are not loaded in the PDV:

%* Select none;
data _null_; set A ; where A>'s' ; run; %*  1.5 seconds;
data _null_; set A ; if    A>'s' ; run; %*  2.7 seconds;
%* Select all;
data _null_; set A ; where A>'r' ; run; %*  7.5 seconds;
data _null_; set A ; if    A>'r' ; run; %*  2.7 seconds;

The break-even point here seems to be where about 20% of records get loaded.

data A(compress=no); A='s'; do I=1 to 1e8;if I>2e7 then A='a'; output; end; run;
sasfile A load;
data _null_; set A ; where A>'r' ; run; %*  2.7 seconds;
data _null_; set A ; if    A>'r' ; run; %*  2.7 seconds;

It is very disappointing that functions in a WHERE clause should be slower. There seems to be ample room for optimisation of the software here.

Ask a Question
Discussion stats
  • 10 replies
  • 202 views
  • 4 likes
  • 9 in conversation