I am trying to create a subset w/ outliers removed for multiple variables (Outliers are defined as > 1.5 x Q3 and < Q1 / 1.5). My approach is to use multiple WHERE statements, but I am not getting the desired result. I am open to other approaches, but I'm also curious why this syntax is not working.
DATA want; SET have; WHERE score1 BETWEEN (1.5*Q3_score1) AND (Q1_score1/1.5); WHERE SAME AND score2 BETWEEN (1.5*Q3_score2) AND (Q1_score2/1.5); WHERE SAME AND score3 BETWEEN (1.5*Q3_score3) AND (Q1_score3/1.5); RUN;
When I run this code w/ only the first where statement, the max value for score1 is 290. When I run it w/ the first and second where statements, the max value for score1 changes to 300.
Aren't these statements independent? Why would one affect the other?
Thanks for your help.
In a SAS data set, I don't think that's one of your possible choices. When you output an observation, all the variables are output. You can't change that from one observation to the next.
There are other things you can do. You can set out of range values to missing before you output. Or you can totally re-shape the data set along these lines:
ID Score_variable score_value
ABC score1 25
ABC score2 30
DEF score2 40
But there is no way to change the variables that get output from one observation to the next.
For those ones that get excluded - between 290 and 300, do they meet the other criteria for score2/score3?
You've used AND so the statements are not independent, all 3 conditions must be met.
If if you want any of the 3 use OR.
@Reeza I want the "AND" in the BETWEEN...AND convention...What would be the correct syntax for making the individual WHERE statements independent from one another?
If it's AND you want, then say AND:
WHERE score1 BETWEEN (1.5*Q3_score1) AND (Q1_score1/1.5) and
SAME AND score2 BETWEEN (1.5*Q3_score2) AND (Q1_score2/1.5) AND
score3 BETWEEN (1.5*Q3_score3) AND (Q1_score3/1.5);
Forget the Where and use explicit IF. You'll save yourself a headache and future you will thank you.
You wont remember the details of this the next time you encounter it and will have to recheck everything otherwise. Or at least that's what I do when I see things like that in prod code. First check to see its doing 1) what you think it's doing, 2) what the original programmer thought they were doing - which may or may not have been you.
Didn't you get the following Note in the Log?
NOTE: WHERE clause has been replaced.
indicating that only the last WHERE clause matters?
@PGStats <NOTE: WHERE clause has been replaced.>
Yes. I saw this message, but didn't understand that it meant that "only the last WHERE clause matters".
Just to confirm, you're saying that all where statements prior to the last one are disregarded?
Do a little testing as I did and you will see that this is the case. I couldn't find it confirmed in the SAS documentation though.
You will need to inspect the exact wording in the note. When you use SAME AND in your WHERE clause, I would expect the note to say that the WHERE clause was AUGMENTED rather than REPLACED.
Run this:
data test;
set sashelp.class;
where sex="M";
where sex="F";
run;
proc print; run;
@PGStats Thanks. The point is made clearly w/ that code. Only the last WHERE statement is output, although in that example the WHERE statement applies to the same variable, where in mine, there variables are different. Regardless, the outcome is the same when the change to code:
data test; set sashelp.class; where sex="M"; where age lt 15; run; proc print; run;
Adding the WHERE-SAME-AND statement eliminates the problem, but this is not what I'm after.
data test; set sashelp.class; where sex="M"; where same and age lt 15; run;
Is there a different way to use multiple WHERE statements when subsetting? Should I chose an entirely different approach?
@Astounding Yes, "augmented".
<NOTE: WHERE clause has been augmented.>
Can you translate this note for me?
Augmented: The conditions from the first WHERE statement are still in effect, and the conditions from the second WHERE statement are being added as an additional set of conditions.
Are you sure you didn't get the results mixed up? It would make all the sense in the world to get a maximum of 300 with just one WHERE statement, but a maximum of 290 when you add a second WHERE statement. The second WHERE statement would remove a few more observations, which could include the one that has the value of 300.
@Astounding Yes, I did get them mixed up. Sorry about that.
I guess I'm confused about how to subset w/o narrowing the dataset to meet the conditions in all the prior WHERE statements.
I want a dataset that contains the values for each variable between the parameters outlined in the BETWEEN...AND statement. And I want the WHERE statements to be indepencent of each other.
In other words, I want all the values of score1 included if they are between 1.5*Q3 and Q1/1.5. And then separately, I want all the values of score2 included if they are between the same parameters for score2, etc..
Any suggestions?
Thanks for your help.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.