Guys, is there a way to flag consecutive years while considering gaps?
Suppose to have:
Person1 2015
Person1 2018
Person1 2019
Person1 2020
Person2 2015
Person2 2016
Person2 2018
Person2 2019
Person3 2015
Person3 2016
Person3 2017
Person3 2020
To have
Person1 2015 0
Person1 2018 1
Person1 2019 1
Person1 2020 1
Person2 2015 1
Person2 2016 1
Person2 2018 2
Person2 2019 2
Person3 2015 1
Person3 2016 1
Person3 2017 1
Person3 2020 0
Thank you in advance.
Just edited to simplify. I used 2 to distinguish 2015-2016 from 2018-2019 for Person2 because there is a gap otherwise they will be considered consecutives while using 1 but any other flag is ok.
@NewUsrStat wrote:
Just need to flag out of consecutive
Assign each sequence its own number. Then you can use BY group processing to exclude the sequences that only have one observation.
data have;
input id $ year ;
cards;
Person1 2015
Person1 2018
Person1 2019
Person1 2020
Person2 2015
Person2 2016
Person2 2018
Person2 2019
Person3 2015
Person3 2016
Person3 2017
Person3 2020
;
data want;
set have;
by id ;
sequence + dif(year) > 1;
if first.id then sequence=1;
run;
proc print;
run;
data subset;
set want;
by id sequence;
if not (first.sequence and last.sequence);
run;
proc print;
run;
I assume by "jumping years" you mean out of sequence. With an unstated requirement of "within Person"
Why is the first record marked as a flag:
Person1 2015 flag Person1 2018 0
I would expect
Person1 2015 0 Person1 2018 flag
Assuming my interpretation above is complete one way:
data have; input person $ year; datalines; Person1 2015 Person1 2018 Person1 2019 Person1 2020 Person2 2015 Person2 2016 Person2 2018 Person2 2019 Person3 2015 Person3 2016 Person3 2017 Person3 2020 ; /* assumes sorted by Person and Year*/ data want; set have; by person; dyr = dif(year); if first.person then flag=0; else flag=dyr>1; drop dyr; run;
The data step is the way to provide example data. When we have to guess we make assumptions as to variable types that may not be correct.
The DIF function returns the numeric difference between the current value of a variable and that of the previous record with some limitations.
SAS when using BY group processing in a data step adds automatic variables First. and Last. for the variables on the By statement that indicate whether the current record is the first or last of the group. These are numeric 1 (true)/ 0 (false) values and can be used as logical values as shown.
If your data is not sorted by ID but is grouped by id with the years in sequence you would add the option NOTSORTED to the BY statement as the default is that the values are sorted and will cause an error if the data is not actually sorted.
A flag value of 1 indicates difference greater than one.
Note: if you are concerned about same values in sequence this will not flag repeated values.
It is the same: I need to flag "all in sequence" or "all out of sequence" because consecutive sequences will be processed (but not "outliers"). Thank you very much!
@NewUsrStat wrote:
Just need to flag out of consecutive
Assign each sequence its own number. Then you can use BY group processing to exclude the sequences that only have one observation.
data have;
input id $ year ;
cards;
Person1 2015
Person1 2018
Person1 2019
Person1 2020
Person2 2015
Person2 2016
Person2 2018
Person2 2019
Person3 2015
Person3 2016
Person3 2017
Person3 2020
;
data want;
set have;
by id ;
sequence + dif(year) > 1;
if first.id then sequence=1;
run;
proc print;
run;
data subset;
set want;
by id sequence;
if not (first.sequence and last.sequence);
run;
proc print;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.