If I have a dataset with variables a and b and I sort by a then b how do I filter for the last entry of b for each a. My assumption is that I would have some sort statement like
where ab is some sort of cartesian product of a and b. But I don't know how to do this. Any thoughts?
Reading the original post and "...filter for the last entry of b for each a", I would say that the BY statement needs to reflect BY A B; and thus the IF test would be IF LAST.B to generate the desired filter logic.
Now note that since b is changing, last.b is always true. In fact last. is only meaningfull if there is a sort order below it. So since each group of A is sorted by B, the way to find the highest value of B within A is to take last.A.
As is the case with your data example, I made no assumption that there is only one unique B value observation within a give A "set", even after sorting. He wanted the last occurrence of B for each unique A value. So, there could be more than one observation with a given B value within A -- a condition that would need to be addressed with the IF LAST.B test.
Also, there may be usefulness with the additional NOSORTED parameter while using BY and FIRST. and/or LAST..
For some circumstances, I also find consecutive PROC SORT invocations, with the last using EQUALS NODUPKEY -- may or may not apply - just mentioning it for consideration (now or later).
Note now Obs 3 and 4 have the same value of b. Again lats.a gives the correct result. Last.b gives the highest value of c within a given b value.
Last.x is the observation of group of x sorted on whatever is next in the by value.
Yes using NODUPKEY solves this, but it is still without question last.A which is correct.
Let's take the DATA step below (using your "modified" WORK.ONE file, having multiple-value B variable instances) -- and given Flip's premise that only filtering on the condition is sufficient:
which will generate two observations (each unique A value) as compared to the OP request for each unique B within each A condition.
Observe the SAS log below with the resulting SAS variables while executing the DATA step with the SET and BY A B; -- the LAST.B condition is clearly "TRUE" more frequently than the LAST.A condition is "TRUE".
So, I submit that if the objective is to yield each unique occurrence of both A and B, the code below will generate the combination values in WORK.TWO:
BY A B;
From this example data representation, the accurate observation count in WORK.TWO should be 4, whereas having the IF statement:
will only generate two (2) observations in SAS file WORK.TWO.
11 data two;
12 set one;
13 by a b;
14 putlog _all_;
15 if last.b;
a=1 b=1 c=1 FIRST.a=1 LAST.a=0 FIRST.b=1 LAST.b=0 _ERROR_=0 _N_=1
a=1 b=1 c=2 FIRST.a=0 LAST.a=0 FIRST.b=0 LAST.b=1 _ERROR_=0 _N_=2
a=1 b=2 c=1 FIRST.a=0 LAST.a=0 FIRST.b=1 LAST.b=0 _ERROR_=0 _N_=3
a=1 b=2 c=2 FIRST.a=0 LAST.a=1 FIRST.b=0 LAST.b=1 _ERROR_=0 _N_=4
a=2 b=1 c=1 FIRST.a=1 LAST.a=0 FIRST.b=1 LAST.b=0 _ERROR_=0 _N_=5
a=2 b=1 c=2 FIRST.a=0 LAST.a=0 FIRST.b=0 LAST.b=1 _ERROR_=0 _N_=6
a=2 b=2 c=1 FIRST.a=0 LAST.a=0 FIRST.b=1 LAST.b=0 _ERROR_=0 _N_=7
a=2 b=2 c=2 FIRST.a=0 LAST.a=1 FIRST.b=0 LAST.b=1 _ERROR_=0 _N_=8
NOTE: There were 8 observations read from the data set WORK.ONE.
NOTE: The data set WORK.TWO has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
I think that you and I are reading the OP differently. He states "how do I filter for the last entry of b for each a" where I am taking this to mean the highest value of B within A, you are reading it as The last value of a given B within A based on some ordering which is not mentioned.
The point I am trying to make is demonstrated below. I purposefully here do not sort on C. As you see the observation getting last.b is not related to any sort order, and thus can be the 'wrong' observation.
As you often state, clarity in a request is of great importance in determining te correct responce.
do a = 1 to 2;
do b=1 to 2;
do c=1 to 2;
proc sort data = one out = two;
by a b descending c;
set one two;
proc sort data = one out = two;
by a b ;
by a b;
la = last.a;
lb = last.b;
The SAS System 16:00 Sunday, August 30, 2009 1
Humbly, let's observe the initial thread question. Here's the first sentence from the OP:
"If I have a dataset with variables a and b and I sort by a then b how do I filter for the last entry of b for each a. "
Of course, I presume that he wants the "last entry" of each B variable value captured for output (which would be the default, since he didn't mention DESCENDING B as a sort criteria specification), with each A variable value (also a contributing "break" variable by definition). The SAS variable B is mentioned as a contributing data-variable to the sort order.
Again, with your second data-set as a decent sample to demonstrate the importance of testing IF LAST.B; , you will either get:
- four output observations: one obs for each unique A variable value and unique B variable value.
- two observations: one for each unique A variable value - oh, by the way, the value of B will be whatever occurred on the last/highest observation of A.
Clearly, data dependent, but the lack of testing: IF LAST.B; could yield incomplete output data.
If you believe that there should only be 2 observations - but the important point here is the need to have a "break" test (for output/filter) selection that includes variable B as well as variable A. That is where we have a difference of technical opinion, I believe.
Oh, just for the fun of discussion, a couple of points.
SBB: "- two observations: one for each unique A variable value - oh, by the way, the value of B will be whatever occurred on the last/highest observation of A."
Actually the value of B would be the maximum value of B within a given A. Last.A order is driven by the sort order of B within A. It will return any old value of C, which is my point.
OP: "If I have a dataset with variables a and b and I sort by a then b"
with only A and B in the data set. sorted by A and B with multiple equivalent values of B within a given A, the statements first.B, last.B, anyold.B would give identical records from A-B pairs.
The reason I stress this is that I have seen several programmers write code like this when trying to get the record with the maximum value of B within A and got the result you are talking about instead. And yes, I can see that the way you are reading the OP could be what he intended, I did not read it that way.