Hello everyone, I have a question about how to find/flag the second-to-last observation by the same ID.
The dataset I have like this:
ID | Assessment date | Status | Status_new1 | Status_new2 |
1 | 2012-06-19 | No | Yes | Yes |
1 | 2012-06-20 | Yes | Yes | Yes |
1 | 2012-06-21 | No | Yes | Yes |
2 | 2012-06-22 | . | No | No |
2 | 2012-06-23 | No | No | No |
2 | 2012-06-24 | . | No | No |
2 | 2012-06-25 | No | No | No |
3 | 2012-06-19 | Unknown | Yes | Yes |
3 | 2012-06-20 | Yes | Yes | Yes |
3 | 2012-06-21 | No | Yes | Yes |
3 | 2012-06-22 | . | Yes | Yes |
3 | 2012-06-23 | No | Yes | Yes |
4 | 2012-06-01 | Unknown | No | Unknown |
4 | 2012-06-03 | . | No | Unknown |
4 | 2012-06-05 | No | No | Unknown |
5 | 2012-06-08 | No | No | Unknown |
The observations that I would like to find/flag is where I used red color:
ID | Assessment date | Status | Status_new1 | Status_new2 | count |
1 | 2012-06-19 | No | Yes | Yes | 1 |
1 | 2012-06-20 | Yes | Yes | Yes | 2 |
1 | 2012-06-21 | No | Yes | Yes | 3 |
2 | 2012-06-22 | . | No | No | 1 |
2 | 2012-06-23 | No | No | No | 2 |
2 | 2012-06-24 | . | No | No | 3 |
2 | 2012-06-25 | No | No | No | 4 |
3 | 2012-06-19 | Unknown | Yes | Yes | 1 |
3 | 2012-06-20 | Yes | Yes | Yes | 2 |
3 | 2012-06-21 | No | Yes | Yes | 3 |
3 | 2012-06-22 | . | Yes | Yes | 4 |
3 | 2012-06-23 | No | Yes | Yes | 5 |
4 | 2012-06-01 | Unknown | No | Unknown | 1 |
4 | 2012-06-03 | . | No | Unknown | 2 |
4 | 2012-06-05 | No | No | Unknown | 3 |
5 | 2012-06-08 | No | No | Unknown | 1 |
I am able to create a new variable, named "count" to get a serial number by ID. But I don't know how to find the second to last observation, because the obs are not fixed. Some IDs could only have 1 ob, some IDs could 3 to 10 obs. If there is a way that can create a new variable to flag every second-to-last observation by ID in the same dataset, that would be great!
Thank you for your help!
Another way to write @mkeintz 's clever logic.
data WANT (drop=_:);
set HAVE (in=FIRSTPASS)
HAVE ;
by ID;
if FIRSTPASS then do;
if first.ID then call missing(_N_FIRSTPASS,_N_SECONDPASS);
_N_FIRSTPASS+1;
delete;
end;
_N_SECONDPASS+1;
FLAG=(_N_SECONDPASS=_N_FIRSTPASS-1);
run;
Like this? [Updated to cater for last group]
data HAVE;
input ID;
cards;
1
1
1
1
2
2
2
2
3
4
4
;
data WANT;
set HAVE nobs=NOBS;
by ID;
if _N_<=NOBS-2 then set HAVE(keep=ID rename=(ID=ID2) firstobs=3);
else ID2=.;
if ID ne ID2 and ^last.ID then FLAG=1;
run;
ID | FLAG |
---|---|
1 | . |
1 | . |
1 | 1 |
1 | . |
2 | . |
2 | . |
2 | 1 |
2 | . |
3 | . |
4 | 1 |
4 | . |
Hello @ChrisNZ , I just tried your code, it works perfectly for subjects who had more than 1 obs, it doesn't work for subjects who only had 1 ob. It's good enough for me! Thank you!
This is why it's a good idea to clearly describe what you want, instead of just a general statement ("second to last"), with sample data whose treatment is not in the description. I now see for ID=5 that you want to flag singletons. So change @ChrisNZ 's code from
FLAG=(_N_SECONDPASS=_N_FIRSTPASS-1) ;
to
FLAG=(_N_SECONDPASS=_N_FIRSTPASS-1) or _N_FIRSTPASS=1;
> it doesn't work for subjects who only had 1 ob.
It works exactly as you asked. As @mkeintz said, make sure your requirements are described properly when asking a question.
Glad you got what you need at the end.
Read all observations for each ID twice - set have (in=firstpass) have (in=secondpass); by id;. The first pass to get a total count of observations (_n_firstpass). The second time to generate the COUNT variable, set a flag when COUNT=_N_FIRSTPASS-1, and output.
This code is untested, in the absence of sample data presented in the form of a working data step:
data want (drop=_:);
set have (in=firstpass) have (in=secondpass);
by id;
if first.id then call missing(_n_firstpass,count);
if firstpass then _n_firstpass+1;
if secondpass;
count+1;
/* superseded by below: flag=ifn(count=_n_firstpass-1,1,0) ; */
flag=ifn(count=_n_firstpass-1,1,0) or _n_firstpass=1;
run;
The single statement
set have (in=firstpass) have (in=secondpass);
would read all of HAVE, and then reread all of HAVE. But combining it with a BY statement, as in:
set have (in=firstpass) have (in=secondpass);
by id;
would read one ID group from have, and then re-read the same group. Then it would proceed to the second group.
Another way to write @mkeintz 's clever logic.
data WANT (drop=_:);
set HAVE (in=FIRSTPASS)
HAVE ;
by ID;
if FIRSTPASS then do;
if first.ID then call missing(_N_FIRSTPASS,_N_SECONDPASS);
_N_FIRSTPASS+1;
delete;
end;
_N_SECONDPASS+1;
FLAG=(_N_SECONDPASS=_N_FIRSTPASS-1);
run;
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.