I understand that WHERE and IF IN can produce different results depending on where each statement is used. However, one of the questions on the (official) SAS 9.4 Certification Practice Exam has me puzzled. The question read:
This project will use data set cert.input36. At any time, you may save your program as program36 in cert\programs.
Write a SAS program that will clean the data in cert.input36 as follows:
- Step 1:
- create a temporary data set, cleandata36.
- In this data set, convert all group values to upper case.
- Then keep only observations with group equal to 'A' or 'B'.
- Step 2:
- Determine the MEDIAN value for the Kilograms variable for each group (A,B) in the cleandata36 data set. Round MEDIAN to the nearest whole number.
- Step 3:
- create results.output36 from cleandata36
- Ensure that all values for variable Kilograms are between 40 and 200, inclusively.
- If the value is missing or out of range, replace the value with the MEDIAN Kilograms value for the respective group (A,B) calculated in step 2.
Run the program and use the results to answer the next 3 questions.
Their suggested code:
data work.cleandata36;
set cert.input36;
group=upcase(group);
if group in ('A','B');
run;
proc means data=work.cleandata36 median;
class group;
var kilograms;
run;
data results.output36;
set cleandata36;
if Kilograms < 40 or Kilograms > 200 then do;
if group='A' then kilograms=79;
else kilograms=89;
end;
run;
proc contents data=results.output36;
run;
My code:
data work.cleandata36;
set cert.input36;
group = upcase(group);
where group = 'A' or group = 'B';
run;
proc means data=work.cleandata36 mean min max median maxdec=1;
class group;
run;
data results.output36;
set cleandata36;
if kilograms < 40 or kilograms > 200 then do;
if group = 'A' then kilograms = 79;
else kilograms = 89;
end;
if kilograms =. then do;
if group = 'A' then kilograms = 79;
else kilograms = 89;
end;
run;
proc print data = results.output36;
run;
proc means data=results.output36 mean min max maxdec=1;
class group;
run;
I've attached input36.sas7bdate in case anyone wanted to test.
When asked how many observations are in results.output36, my code (with the WHERE statement) resulted in 4897 observations. This was wrong, as their code (using IF IN) resulted in 4992. I tested my code using IF IN instead of WHERE and got the same result they did, 4992. But I'm not sure why. It would seem that both statements would subset to the same number of observations.
Also, their code is missing the portion that accounts for the instruction to replace missing values with the median -- I'm surprised they wouldn't include it in their suggested answer.
ah! It took me few times rereading the documentation, but I think light finally dawned:
WHERE subsetted the data before UPCASE happened ("The WHERE statement selects observations before they are brought into the program data vector. "), and so even though I changed all values of "group" to upper case, it happened after WHERE selected its observations. Thus, my WHERE subset is smaller because it didn't have access to all the instances of "group" where 'a' and 'b' had been converted to upper case.
IF and WHERE in a data step work differently. This is explained in the documentation: https://documentation.sas.com/doc/en/pgmmvacdc/9.4/lestmtsref/p1cxl8ifdt8u0gn12wqbji8o5fq1.htm#p08fp...
@TheresaM Here is a bit more of a hint: Notice this bullet from step 1:
Then learn the difference between how IF and WHERE process data using the resource from @PaigeMiller and I think you will see why your code produces a different result.
Thank you, @PaigeMiller and @Mark2010 . I'll take some more time to understand the documentation!
One way to explore this is to make a small test dataset, like:
data play ;
input group $1. ;
cards ;
A
B
C
D
.
a
b
c
d
;
Then you could use that as an input to your WHERE data step and your subsetting IF data step and see how the results differ.
ah! It took me few times rereading the documentation, but I think light finally dawned:
WHERE subsetted the data before UPCASE happened ("The WHERE statement selects observations before they are brought into the program data vector. "), and so even though I changed all values of "group" to upper case, it happened after WHERE selected its observations. Thus, my WHERE subset is smaller because it didn't have access to all the instances of "group" where 'a' and 'b' had been converted to upper case.
Bingo. Congrats on working through it.
You got it. So you could use WHERE, like:
where upcase(group) = 'A' or upcase(group) = 'B';
or:
where upcase(group) IN ('A' , 'B');
And it would select the same records as your IF. You would still need to use UPCASE inside the data step to upcase the values.
@TheresaM I suggest you select your own answer as the solution. @Mark2010 gave you a good hint, but you came up with the solution and wrote it as an answer. You also did a good job of phrasing the question clearly, and showing sample code. So if anyone finds it in the future, it would be good for them to see your answer as the solution. You should be able to click on Mark's answer and see an option to "unmark as correct" or something like that.
Hi @Quentin
Thank you for this. This actually made me really happy to read! Everything I thought I understood became much more of a struggle as I actually try to put it in practice (or answer practice exam questions). So, hearing that I actually "came up with a solution" gives me hope. 😊🎉
Hi Paige,
I thought I understood all those differences as outlined in the page you reference. However, the way I read all that -- it would seem that my use of WHERE would produce the larger subset (since it selects observations before they're brought into PDV), yet somehow IF IN subsetted a larger number of observations.
I know this is a (very) lagged response. But ...
I find it most useful to regard the WHERE statement (or where option as a dataset name parameter) as a process outsourced to the data engine.
That's why WHERE can be used in PROC's as well as DATA steps - unlike IF.
Hi @mkeintz ,
Thanks for this too! Where vs IF, when to use the = sign vs when not to, DO while vs DO until... so much to remember!! I know it will become easier to remember the more programs that I write, but until then, every tip helps. So thanks for responding to my post 🙂 🙂
This is a knowledge-sharing community for SAS Certified Professionals and anyone who wants to learn more about becoming SAS Certified. Ask questions and get answers fast. Share with others who are interested in certification and who are studying for certifications.To get the most from your community experience, use these getting-started resources:
Community Do's and Don'ts
How to add SAS syntax to your post
How to get fast, helpful answers
Ready to level-up your skills? Choose your own adventure.