Solved: Re: Do “subsetting if” and “where” clauses do the same thing?

Mirisage · Posted 08-13-2012 03:24 PM

Hi SAS Community,

I have applied “subsetting if” and “where” clauses like below and they produced the same number of records in both table “want_1” and “want_2”.

1). Approach I - “subsetting if”

data want_1;

SET have;

if bank_number = 40;

run;

2). Approach II - “where”

data want_2;

SET have;

Where bank_number = 40;

run;

Question

However, SAS sometimes produce correct results for wrong reasons if we do not use it with full knowledge what is happening inside. This happened to me many times.

So, could any expert confirm me if the Approach I and II above produce same results always?

Thank you

Mirisage

Linlin · Posted 08-13-2012 03:36 PM

if variable bank_number is already in dataset have then if and where would produce the same result.

if variable bank_number is created in the same data step then only if works.

example:

data class;
set sashelp.class;
if age <14 then group=1;
else group=2;
if group=1;/* you can't use "where" */
proc print;run;

Obs Name Sex Age Height Weight group

                      1    Alice       F      13     56.5       84.0      1
                      2    Barbara     F      13     65.3       98.0      1
                      3    James       M      12     57.3       83.0      1
                      4    Jane        F      12     59.8       84.5      1
                      5    Jeffrey     M      13     62.5       84.0      1
                      6    John        M      12     59.0       99.5      1
                      7    Joyce       F      11     51.3       50.5      1
                      8    Louise      F      12     56.3       77.0      1
                      9    Robert      M      12     64.8      128.0      1
                     10    Thomas      M      11     57.5       85.0      1

View solution in original post

Reeza · Posted 08-13-2012 03:35 PM

No they don't.

See the following paper for some examples/details:

http://www.lexjansen.com/wuss/2005/sas_solutions/sol_where_vs_if_statements.pdf

Linlin · Posted 08-13-2012 03:36 PM

if variable bank_number is already in dataset have then if and where would produce the same result.

if variable bank_number is created in the same data step then only if works.

example:

data class;
set sashelp.class;
if age <14 then group=1;
else group=2;
if group=1;/* you can't use "where" */
proc print;run;

Obs Name Sex Age Height Weight group

                      1    Alice       F      13     56.5       84.0      1
                      2    Barbara     F      13     65.3       98.0      1
                      3    James       M      12     57.3       83.0      1
                      4    Jane        F      12     59.8       84.5      1
                      5    Jeffrey     M      13     62.5       84.0      1
                      6    John        M      12     59.0       99.5      1
                      7    Joyce       F      11     51.3       50.5      1
                      8    Louise      F      12     56.3       77.0      1
                      9    Robert      M      12     64.8      128.0      1
                     10    Thomas      M      11     57.5       85.0      1

art297 · Posted 08-13-2012 03:51 PM

Someone will have to correct me if I'm wrong but, in the example you provided, I think that the two methods WILL always produce the same result.

And, the paper that Fareeza referenced and example that Linlin gave, the paper and example didn't take into account the where data step option. e.g.,

data class (where=(group eq 1));

set sashelp.class;

if age <14 then group=1;

else group=2;

run;

Reeza · Posted 08-13-2012 04:08 PM

And also a where in a data step that has a merge doesn't always do what I think it should sometimes.

The overall answer to your question is 'it depends on the situation'.

Astounding · Posted 08-13-2012 05:34 PM

In your examples, you get the same result as long as BANK_NUMBER already exists in your incoming data set, and as long as it is defined as being numeric.

The problem is really that you may add features to your DATA step without really understanding how the results could suddenly become different. For example, here are two common situations. If you add END= on the SET statement, or if you add a BY statement, the results could very easily be different. While you are relatively safe for this particular example that you posted, those simple changes will suddenly produce different results. Moreover, you may not even notice that the results would be different because they might depend on the data. For some versions of HAVE, adding END= or BY still produces identical results. For other versions of HAVE, the results could be different. There is no substitute for understanding how IF and WHERE work.

Good luck.

mkeintz · Posted 08-13-2012 06:25 PM

The "where" clause is applied in the data engine. Think of it as being applied BEFORE the data step sees the data - so cases that don't qualify are never in the PDV, which is why this won't work:

data want;
set sashelp.class;

hw=height*weight;

where hw>0;
run;

But an "if hw>0" would work. It doesn't get invoked until data is in the PDV. This understanding also help to explain the impact on options like FIRSTOBS and OBS in the presence of WHERE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 08-14-2012 12:24 AM

One more thing, WHERE is always faster than IF .not only it pick up obs before obs enter PDV (not like IF,it is after obs enter PDV) ,but also WHERE can use INDEX for you .

Ksharp

Haikuo · Posted 08-14-2012 04:10 PM

Quote from Keshan: 'WHERE is always faster than IF'.

NOT always, especially when dealing huge data set. In my line of work (Part D claim data), multimillion level of records is the routine. Depending on the data retrieval rate, I found 'if' is faster than 'where' in many cases, sometimes a lot faster if the rate is a bit extreme. 'where' bears an edge when the retrieval rate is less than 2-3%. (sounds familiar? reminding me of index)

Here is a example (95% data retrieval), please note, when data structure is more complicated (more variables, more variable types and length), the difference tends to be larger:

40 DATA test;

41 do i=1 to 100000000;

42 output;

43 end;

44 run;

NOTE: The data set WORK.TEST has 100000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 7.95 seconds

cpu time 5.29 seconds

45

46 data h1;

47 set test;

48 where i>5000000;

49 run;

NOTE: There were 95000000 observations read from the data set WORK.TEST.

WHERE i>5000000;

NOTE: The data set WORK.H1 has 95000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 20.37 seconds

cpu time 16.98 seconds

50

51 data h2;

52 set test;

53 if i>5000000;

54 run;

NOTE: There were 100000000 observations read from the data set WORK.TEST.

NOTE: The data set WORK.H2 has 95000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 11.82 seconds

cpu time 10.20 seconds

Haikuo

art297 · Posted 08-14-2012 04:41 PM

haikuo: I get similar results with just 1,000,000 records (i.e., if being faster than where) .. even with a one variable indexed file.

Haikuo · Posted 08-14-2012 05:17 PM

Yes, Art, I agree. Difference gets obvious when large data set involved. In general, it is a matter of personal preference, while I tremendously prefer 'if'. My major reason is that 'if' makes my life a lot easier, for 'if' can be used with '_n_', last-first.variable and 'by' statement, while 'where' can't. To me, the only case I use 'where' (besides procs of course) is that I have to use 'like' operator, which is really rare.

Haikuo

GeoffreyBrent · Posted 08-14-2012 09:04 PM

The two code examples you provide should give identical results, though WHERE may be faster to run.

However there are many cases where WHERE and subsetting IF do not give identical output. Compare:

data test;

length x $5;

input x;

cards;

One

Two

Three

Four

Five

;

run;

data subset_if;

set test;

sequence_number=_n_;

previous_x=lag(x);

if not(x="Three");

run;

data subset_where;

set test;

sequence_number=_n_;

previous_x=lag(x);

where not(x="Three");

run;

In creating SUBSET_IF, we read in and process all lines from TEST before dropping the third observation. Even though the third observation isn't output directly, it still affects what is output: sequence_number goes 1,2,4,5 and previous_x goes " ", "One", "Three", "Four".

In creating SUBSET_WHERE, the observation with x="Three" is deleted at an earlier stage, and there's no sign in the output that it ever existed.

Mirisage · Posted 08-15-2012 02:55 PM

Hi Everyone,

Thank everyone of you for your time and expertise.

This generated a tremondous knowledge base on the subject.

Best regards

Mirisage

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away