Solved: Re: Find the last row with some condition

JillChen0131 · Posted 03-04-2022 02:40 PM

I would like to find the last row where testdate is earlier than date within each group. Thanks

Data like below:

ID TESTDATE DATE

A Nov 20, 2019 Nov 25, 2019

A Nov 30, 2019 Nov 25, 2019

A Dec 1, 2019 Nov 25, 2019

B Nov 1, 2019 Nov 18, 2019

B Nov 13, 2019 Nov 18, 2019

B Nov 24, 2019 Nov 18, 2019

mkeintz · Posted 03-04-2022 11:47 PM

If I understand correctly, and - I believe you confirmed - (1) data are sorted by id/testdate, (2) each id has a constant DATE, then you merely need to look ahead one obs to see whether the current obs satisfies the testdate<date condition and the next obs violates the condition:

data have;
input ID $  (TESTDATE DATE) (:date9.);
format TESTDATE DATE yymmdd10.;
datalines;
A   20nov2019  25nov2019
A   30nov2019  25nov2019
A   01dec2019  25nov2019
B   01nov2019  18nov2019
B   13nov2019  18nov2019
B   24nov2019  18nov2019
;
data want (drop=nxt_:);
  set have (keep=id);
  by id;
  merge have have (firstobs=2 keep=testdate rename=(testdate=nxt_testdate));
  if testdate<date and (last.id=1 or nxt_testdate>=date) then flag=1;
  else flag=0;
run;

By looking ahead (using firstobs=2 in the second argument of the MERGE statement), you avoid the need to re-merge.

Now if you don't have a constant DATE value for a given ID, then a simple lookahead doesn't work. Instead you have to read each ID twice - the first time to scan all obs and identify the last (physically last) record satisfying the condition. The second time to output the results with flag=1 in the correct observation. In this case the data need to be grouped only by ID, no secondary sort key needed:

data want (drop=_:);
  set have (in=firstpass) have (in=secondpass);
  by id;
  _nf+firstpass;
  retain _last_qualifier;
  if firstpass and testdate<date then _last_qualifier=_nf ;
  if secondpass;
  _ns+1;
  flag=(_ns=_last_qualifier);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PaigeMiller · Posted 03-04-2022 03:00 PM

Is date always the same for each record in an ID? Is the testdate always sorted in chronological order? Is "last row" the same as highest value of testdate?

If so

data want;
    set have(where=(testdate<date));
    by id;
    if last.id;
run;

If those assumptions are not valid, then please explain the organization of the data in a lot more detail.

--
Paige Miller

JillChen0131 · Posted 03-04-2022 03:18 PM

answer are all "YES" for your questions. What if I want to flag that record in the original dataset? Should I just merge the new dataset and original? I came out this way when I was doing the programming. Is there any other way without creating a new dataset and merge back?

PaigeMiller · Posted 03-04-2022 03:32 PM

Original data set needs a flag. 🏁 Okay

Yes, you could merge the data set from my code back in with the original data set (and then optionally delete the data set from my code). Or this:

proc sql;
    create table want as select a.*,a.testdate=max(b.testdate) as flag
    from have as a left join have(where=(testdate<date)) as b on a.id=b.id and a.testdate=b.testdate
    group by b.id
    order by a.id,a.testdate;
quit;

--
Paige Miller

Patrick · Posted 03-04-2022 06:51 PM

Or via a Sort and data step - but I'd be using the SQL as proposed by @PGStats

data have;
input ID $  (TESTDATE DATE) (:date9.);
format TESTDATE DATE yymmdd10.;
datalines;
A   20nov2019  25nov2019
A   30nov2019  25nov2019
A   01dec2019  25nov2019
B   01nov2019  18nov2019
B   13nov2019  18nov2019
B   24nov2019  18nov2019
;

proc sort data=have(where=(testdate<date)) out=want;
 by id date testdate;
run;

data want;
  set want;
  by id;
  if last.date;
run;

PaigeMiller · Posted 03-05-2022 05:12 AM

@Patrick wrote:

Or via a Sort and data step - but I'd be using the SQL as proposed by @PGStats

data have;
input ID $  (TESTDATE DATE) (:date9.);
format TESTDATE DATE yymmdd10.;
datalines;
A   20nov2019  25nov2019
A   30nov2019  25nov2019
A   01dec2019  25nov2019
B   01nov2019  18nov2019
B   13nov2019  18nov2019
B   24nov2019  18nov2019
;

proc sort data=have(where=(testdate<date)) out=want;
 by id date testdate;
run;

data want;
  set want;
  by id;
  if last.date;
run;

Already proposed, and furthermore the OP states that she wants all the records in the original data set and a new variable flag to indicate which is 1 if this is the record of interest and 0 otherwise.

--
Paige Miller

PGStats · Posted 03-04-2022 04:07 PM

Use the auto-remerge feature of proc SQL to do this with a single query:

data have;
input ID $  (TESTDATE DATE) (:date9.);
format TESTDATE DATE yymmdd10.;
datalines;
A   20nov2019  25nov2019
A   30nov2019  25nov2019
A   01dec2019  25nov2019
B   01nov2019  18nov2019
B   13nov2019  18nov2019
B   24nov2019  18nov2019
;

proc sql;
create table want as
select * 
from have
where testdate lt date
group by id
having testdate = max(testdate);
select * from want;
quit;

PG

mkeintz · Posted 03-04-2022 11:47 PM

If I understand correctly, and - I believe you confirmed - (1) data are sorted by id/testdate, (2) each id has a constant DATE, then you merely need to look ahead one obs to see whether the current obs satisfies the testdate<date condition and the next obs violates the condition:

data have;
input ID $  (TESTDATE DATE) (:date9.);
format TESTDATE DATE yymmdd10.;
datalines;
A   20nov2019  25nov2019
A   30nov2019  25nov2019
A   01dec2019  25nov2019
B   01nov2019  18nov2019
B   13nov2019  18nov2019
B   24nov2019  18nov2019
;
data want (drop=nxt_:);
  set have (keep=id);
  by id;
  merge have have (firstobs=2 keep=testdate rename=(testdate=nxt_testdate));
  if testdate<date and (last.id=1 or nxt_testdate>=date) then flag=1;
  else flag=0;
run;

By looking ahead (using firstobs=2 in the second argument of the MERGE statement), you avoid the need to re-merge.

Now if you don't have a constant DATE value for a given ID, then a simple lookahead doesn't work. Instead you have to read each ID twice - the first time to scan all obs and identify the last (physically last) record satisfying the condition. The second time to output the results with flag=1 in the correct observation. In this case the data need to be grouped only by ID, no secondary sort key needed:

data want (drop=_:);
  set have (in=firstpass) have (in=secondpass);
  by id;
  _nf+firstpass;
  retain _last_qualifier;
  if firstpass and testdate<date then _last_qualifier=_nf ;
  if secondpass;
  _ns+1;
  flag=(_ns=_last_qualifier);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Training: Just a Click Away