BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi SAS Community,

I have applied “subsetting if” and “where” clauses like below and they produced the same number of records in both table “want_1” and “want_2”.

1). Approach I - “subsetting if”

data want_1;

SET have;

if bank_number = 40;

run;

2). Approach II - “where”

data want_2;

SET have;

Where bank_number = 40;

run;

Question

However, SAS sometimes produce correct results for wrong reasons if we do not use it with full knowledge what is happening inside. This happened to me many times.

So, could any expert confirm me if the Approach I and II above produce same results always?

Thank you

Mirisage

1 ACCEPTED SOLUTION

Accepted Solutions
Linlin
Lapis Lazuli | Level 10

if variable bank_number is already in dataset have then if and where would produce the same result.

if variable bank_number is created in the same data step then only if works.

example:

data class;
  set sashelp.class;
  if age <14 then group=1;
    else group=2;
  if group=1;/* you can't use "where" */
  proc print;run;
 
                   Obs    Name       Sex    Age    Height    Weight    group

                      1    Alice       F      13     56.5       84.0      1
                      2    Barbara     F      13     65.3       98.0      1
                      3    James       M      12     57.3       83.0      1
                      4    Jane        F      12     59.8       84.5      1
                      5    Jeffrey     M      13     62.5       84.0      1
                      6    John        M      12     59.0       99.5      1
                      7    Joyce       F      11     51.3       50.5      1
                      8    Louise      F      12     56.3       77.0      1
                      9    Robert      M      12     64.8      128.0      1
                     10    Thomas      M      11     57.5       85.0      1

View solution in original post

12 REPLIES 12
Linlin
Lapis Lazuli | Level 10

if variable bank_number is already in dataset have then if and where would produce the same result.

if variable bank_number is created in the same data step then only if works.

example:

data class;
  set sashelp.class;
  if age <14 then group=1;
    else group=2;
  if group=1;/* you can't use "where" */
  proc print;run;
 
                   Obs    Name       Sex    Age    Height    Weight    group

                      1    Alice       F      13     56.5       84.0      1
                      2    Barbara     F      13     65.3       98.0      1
                      3    James       M      12     57.3       83.0      1
                      4    Jane        F      12     59.8       84.5      1
                      5    Jeffrey     M      13     62.5       84.0      1
                      6    John        M      12     59.0       99.5      1
                      7    Joyce       F      11     51.3       50.5      1
                      8    Louise      F      12     56.3       77.0      1
                      9    Robert      M      12     64.8      128.0      1
                     10    Thomas      M      11     57.5       85.0      1

art297
Opal | Level 21

Someone will have to correct me if I'm wrong but, in the example you provided, I think that the two methods WILL always produce the same result.

And, the paper that Fareeza referenced and example that Linlin gave, the paper and example didn't take into account the where data step option.  e.g.,

data class (where=(group eq 1));

  set sashelp.class;

  if age <14 then group=1;

    else group=2;

run;

Reeza
Super User

And also a where in a data step that has a merge doesn't always do what I think it should sometimes.

The overall answer to your question is 'it depends on the situation'.

Astounding
PROC Star

In your examples, you get the same result as long as BANK_NUMBER already exists in your incoming data set, and as long as it is defined as being numeric.

The problem is really that you may add features to your DATA step without really understanding how the results could suddenly become different.  For example, here are two common situations.  If you add END= on the SET statement, or if you add a BY statement, the results could very easily be different.  While you are relatively safe for this particular example that you posted, those simple changes will suddenly produce different results.  Moreover, you may not even notice that the results would be different because they might depend on the data.  For some versions of HAVE, adding END= or BY still produces identical results.  For other versions of HAVE, the results could be different.  There is no substitute for understanding how IF and WHERE work.

Good luck.

mkeintz
PROC Star

The "where" clause is applied in the data engine.  Think of it as being applied BEFORE the data step sees the data - so cases that don't qualify are never in the PDV, which is why this won't work:

data want;
   set sashelp.class;

   hw=height*weight;

   where hw>0;
run;

But an "if hw>0" would work.  It doesn't get invoked until data is in the PDV.  This understanding also help to explain the impact on options like FIRSTOBS and OBS in the presence of WHERE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User

One more thing, WHERE is always faster than IF .not only it pick up obs before obs enter PDV (not like IF,it is after obs enter PDV) ,but also WHERE can use INDEX for you .

Ksharp

Haikuo
Onyx | Level 15

Quote from Keshan: 'WHERE is always faster than IF'.

NOT always, especially when dealing huge data set. In my line of work (Part D claim data), multimillion level of records is the routine. Depending on the data retrieval rate, I found 'if' is faster than 'where' in many cases, sometimes a lot faster if the rate is a bit extreme. 'where' bears an edge when the retrieval rate is less than 2-3%. (sounds familiar? reminding me of  index)

Here is a example (95% data retrieval), please note, when data structure is more complicated (more variables, more variable types and length), the difference tends to be larger:

40   DATA test;

41   do i=1 to 100000000;

42   output;

43   end;

44   run;

NOTE: The data set WORK.TEST has 100000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

      real time           7.95 seconds

      cpu time            5.29 seconds

45

46   data h1;

47   set test;

48   where i>5000000;

49   run;

NOTE: There were 95000000 observations read from the data set WORK.TEST.

      WHERE i>5000000;

NOTE: The data set WORK.H1 has 95000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

      real time           20.37 seconds

      cpu time            16.98 seconds

50

51   data h2;

52   set test;

53   if i>5000000;

54   run;

NOTE: There were 100000000 observations read from the data set WORK.TEST.

NOTE: The data set WORK.H2 has 95000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

      real time           11.82 seconds

      cpu time            10.20 seconds

Haikuo

art297
Opal | Level 21

haikuo: I get similar results with just 1,000,000 records (i.e., if being faster than where) .. even with a one variable indexed file.

Haikuo
Onyx | Level 15

Yes, Art, I agree. Difference gets obvious when large data set involved. In general, it is a matter of personal preference, while I tremendously prefer 'if'. My major reason is that 'if' makes my life a lot easier, for 'if' can be used with '_n_', last-first.variable and 'by' statement, while 'where' can't. To me, the only case I use 'where' (besides procs of course) is that I have to use 'like' operator, which is really rare.

Haikuo

GeoffreyBrent
Calcite | Level 5

The two code examples you provide should give identical results, though WHERE may be faster to run.

However there are many cases where WHERE and subsetting IF do not give identical output. Compare:

data test;

length x $5;

input x;

cards;

One

Two

Three

Four

Five

;

run;

data subset_if;

set test;

sequence_number=_n_;

previous_x=lag(x);

if not(x="Three");

run;

data subset_where;

set test;

sequence_number=_n_;

previous_x=lag(x);

where not(x="Three");

run;

In creating SUBSET_IF, we read in and process all lines from TEST before dropping the third observation. Even though the third observation isn't output directly, it still affects what is output: sequence_number goes 1,2,4,5 and previous_x goes " ", "One", "Three", "Four".

In creating SUBSET_WHERE, the observation with x="Three" is deleted at an earlier stage, and there's no sign in the output that it ever existed.

Mirisage
Obsidian | Level 7

Hi Everyone,

Thank everyone of you for your time and expertise.

This generated a tremondous knowledge base on the subject.

Best regards

Mirisage

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 2184 views
  • 8 likes
  • 9 in conversation