- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to create a subset w/ outliers removed for multiple variables (Outliers are defined as > 1.5 x Q3 and < Q1 / 1.5). My approach is to use multiple WHERE statements, but I am not getting the desired result. I am open to other approaches, but I'm also curious why this syntax is not working.
DATA want; SET have; WHERE score1 BETWEEN (1.5*Q3_score1) AND (Q1_score1/1.5); WHERE SAME AND score2 BETWEEN (1.5*Q3_score2) AND (Q1_score2/1.5); WHERE SAME AND score3 BETWEEN (1.5*Q3_score3) AND (Q1_score3/1.5); RUN;
When I run this code w/ only the first where statement, the max value for score1 is 290. When I run it w/ the first and second where statements, the max value for score1 changes to 300.
Aren't these statements independent? Why would one affect the other?
Thanks for your help.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In a SAS data set, I don't think that's one of your possible choices. When you output an observation, all the variables are output. You can't change that from one observation to the next.
There are other things you can do. You can set out of range values to missing before you output. Or you can totally re-shape the data set along these lines:
ID Score_variable score_value
ABC score1 25
ABC score2 30
DEF score2 40
But there is no way to change the variables that get output from one observation to the next.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For those ones that get excluded - between 290 and 300, do they meet the other criteria for score2/score3?
You've used AND so the statements are not independent, all 3 conditions must be met.
If if you want any of the 3 use OR.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Reeza I want the "AND" in the BETWEEN...AND convention...What would be the correct syntax for making the individual WHERE statements independent from one another?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If it's AND you want, then say AND:
WHERE score1 BETWEEN (1.5*Q3_score1) AND (Q1_score1/1.5) and
SAME AND score2 BETWEEN (1.5*Q3_score2) AND (Q1_score2/1.5) AND
score3 BETWEEN (1.5*Q3_score3) AND (Q1_score3/1.5);
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Forget the Where and use explicit IF. You'll save yourself a headache and future you will thank you.
You wont remember the details of this the next time you encounter it and will have to recheck everything otherwise. Or at least that's what I do when I see things like that in prod code. First check to see its doing 1) what you think it's doing, 2) what the original programmer thought they were doing - which may or may not have been you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Didn't you get the following Note in the Log?
NOTE: WHERE clause has been replaced.
indicating that only the last WHERE clause matters?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PGStats <NOTE: WHERE clause has been replaced.>
Yes. I saw this message, but didn't understand that it meant that "only the last WHERE clause matters".
Just to confirm, you're saying that all where statements prior to the last one are disregarded?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do a little testing as I did and you will see that this is the case. I couldn't find it confirmed in the SAS documentation though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You will need to inspect the exact wording in the note. When you use SAME AND in your WHERE clause, I would expect the note to say that the WHERE clause was AUGMENTED rather than REPLACED.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Run this:
data test;
set sashelp.class;
where sex="M";
where sex="F";
run;
proc print; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PGStats Thanks. The point is made clearly w/ that code. Only the last WHERE statement is output, although in that example the WHERE statement applies to the same variable, where in mine, there variables are different. Regardless, the outcome is the same when the change to code:
data test; set sashelp.class; where sex="M"; where age lt 15; run; proc print; run;
Adding the WHERE-SAME-AND statement eliminates the problem, but this is not what I'm after.
data test; set sashelp.class; where sex="M"; where same and age lt 15; run;
Is there a different way to use multiple WHERE statements when subsetting? Should I chose an entirely different approach?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Astounding Yes, "augmented".
<NOTE: WHERE clause has been augmented.>
Can you translate this note for me?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Augmented: The conditions from the first WHERE statement are still in effect, and the conditions from the second WHERE statement are being added as an additional set of conditions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Are you sure you didn't get the results mixed up? It would make all the sense in the world to get a maximum of 300 with just one WHERE statement, but a maximum of 290 when you add a second WHERE statement. The second WHERE statement would remove a few more observations, which could include the one that has the value of 300.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Astounding Yes, I did get them mixed up. Sorry about that.
I guess I'm confused about how to subset w/o narrowing the dataset to meet the conditions in all the prior WHERE statements.
I want a dataset that contains the values for each variable between the parameters outlined in the BETWEEN...AND statement. And I want the WHERE statements to be indepencent of each other.
In other words, I want all the values of score1 included if they are between 1.5*Q3 and Q1/1.5. And then separately, I want all the values of score2 included if they are between the same parameters for score2, etc..
Any suggestions?
Thanks for your help.