Solved: How to find/flag the second-to-last observation by ID

N0o9r5a · Posted 06-20-2021 11:34 PM

Hello everyone, I have a question about how to find/flag the second-to-last observation by the same ID.

The dataset I have like this:

ID	Assessment date	Status	Status_new1	Status_new2
1	2012-06-19	No	Yes	Yes
1	2012-06-20	Yes	Yes	Yes
1	2012-06-21	No	Yes	Yes
2	2012-06-22	.	No	No
2	2012-06-23	No	No	No
2	2012-06-24	.	No	No
2	2012-06-25	No	No	No
3	2012-06-19	Unknown	Yes	Yes
3	2012-06-20	Yes	Yes	Yes
3	2012-06-21	No	Yes	Yes
3	2012-06-22	.	Yes	Yes
3	2012-06-23	No	Yes	Yes
4	2012-06-01	Unknown	No	Unknown
4	2012-06-03	.	No	Unknown
4	2012-06-05	No	No	Unknown
5	2012-06-08	No	No	Unknown

The observations that I would like to find/flag is where I used red color:

ID	Assessment date	Status	Status_new1	Status_new2	count
1	2012-06-19	No	Yes	Yes	1
1	2012-06-20	Yes	Yes	Yes	2
1	2012-06-21	No	Yes	Yes	3
2	2012-06-22	.	No	No	1
2	2012-06-23	No	No	No	2
2	2012-06-24	.	No	No	3
2	2012-06-25	No	No	No	4
3	2012-06-19	Unknown	Yes	Yes	1
3	2012-06-20	Yes	Yes	Yes	2
3	2012-06-21	No	Yes	Yes	3
3	2012-06-22	.	Yes	Yes	4
3	2012-06-23	No	Yes	Yes	5
4	2012-06-01	Unknown	No	Unknown	1
4	2012-06-03	.	No	Unknown	2
4	2012-06-05	No	No	Unknown	3
5	2012-06-08	No	No	Unknown	1

I am able to create a new variable, named "count" to get a serial number by ID. But I don't know how to find the second to last observation, because the obs are not fixed. Some IDs could only have 1 ob, some IDs could 3 to 10 obs. If there is a way that can create a new variable to flag every second-to-last observation by ID in the same dataset, that would be great!

Thank you for your help!

ChrisNZ · Posted 06-21-2021 04:17 AM

Another way to write @mkeintz 's clever logic.

data WANT (drop=_:);
  set HAVE (in=FIRSTPASS) 
      HAVE ;
  by ID;
  if FIRSTPASS then do;
    if first.ID then call missing(_N_FIRSTPASS,_N_SECONDPASS);
    _N_FIRSTPASS+1;
    delete;
  end;
  _N_SECONDPASS+1; 
  FLAG=(_N_SECONDPASS=_N_FIRSTPASS-1);
run;

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 06-21-2021 12:17 AM

Like this? [Updated to cater for last group]

data HAVE; 
  input ID;
cards;
1 
1 
1 
1 
2 
2 
2 
2 
3
4
4
;
data WANT;
  set HAVE nobs=NOBS;
  by ID;
  if _N_<=NOBS-2 then set HAVE(keep=ID rename=(ID=ID2) firstobs=3);
  else ID2=.;
  if ID ne ID2 and ^last.ID then FLAG=1;
run;

ID	FLAG
1	.
1	.
1	1
1	.
2	.
2	.
2	1
2	.
3	.
4	1
4	.

High-Performance SAS Coding - Third Edition

N0o9r5a · Posted 06-21-2021 03:07 PM

Hello @ChrisNZ , I just tried your code, it works perfectly for subjects who had more than 1 obs, it doesn't work for subjects who only had 1 ob. It's good enough for me! Thank you!

mkeintz · Posted 06-21-2021 05:00 PM

This is why it's a good idea to clearly describe what you want, instead of just a general statement ("second to last"), with sample data whose treatment is not in the description. I now see for ID=5 that you want to flag singletons. So change @ChrisNZ 's code from

  FLAG=(_N_SECONDPASS=_N_FIRSTPASS-1) ;

to

  FLAG=(_N_SECONDPASS=_N_FIRSTPASS-1) or _N_FIRSTPASS=1;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

N0o9r5a · Posted 06-21-2021 07:45 PM

Hello @mkeintz , I just tried your code based on @ChrisNZ modified version. It worked perfectly! It also worked for subjects who only had one ob. Thank you for your help! I really appreciate it!

ChrisNZ · Posted 06-22-2021 02:35 AM

> it doesn't work for subjects who only had 1 ob.

It works exactly as you asked. As @mkeintz said, make sure your requirements are described properly when asking a question.

Glad you got what you need at the end.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 06-21-2021 12:31 AM

Read all observations for each ID twice - set have (in=firstpass) have (in=secondpass); by id;. The first pass to get a total count of observations (_n_firstpass). The second time to generate the COUNT variable, set a flag when COUNT=_N_FIRSTPASS-1, and output.

This code is untested, in the absence of sample data presented in the form of a working data step:

data want (drop=_:);
  set have (in=firstpass)  have (in=secondpass);
  by id;
  if first.id then call missing(_n_firstpass,count);
  if firstpass then _n_firstpass+1;

  if secondpass;
  count+1;
  /* superseded by below: flag=ifn(count=_n_firstpass-1,1,0) ; */
  flag=ifn(count=_n_firstpass-1,1,0) or _n_firstpass=1;
run;

The single statement

  set have (in=firstpass)  have (in=secondpass);

would read all of HAVE, and then reread all of HAVE. But combining it with a BY statement, as in:

  set have (in=firstpass)  have (in=secondpass);
  by id;

would read one ID group from have, and then re-read the same group. Then it would proceed to the second group.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 06-21-2021 04:17 AM

Another way to write @mkeintz 's clever logic.

data WANT (drop=_:);
  set HAVE (in=FIRSTPASS) 
      HAVE ;
  by ID;
  if FIRSTPASS then do;
    if first.ID then call missing(_N_FIRSTPASS,_N_SECONDPASS);
    _N_FIRSTPASS+1;
    delete;
  end;
  _N_SECONDPASS+1; 
  FLAG=(_N_SECONDPASS=_N_FIRSTPASS-1);
run;

High-Performance SAS Coding - Third Edition