Hello,
I have a property-year level dataset that has inconsistent reporting between 2000 and 2018. My goal is to look at each property (ID), and find the largest "chunk" of consecutive years, as well as mark the middle year of that chunk.
So far, I have created a variable that is 1 for consecutive years and 0 for non-consecutive years.
proc sort data=have;
by ID year;
run;
data have;
set have;
if lag_ID=ID then do;
if lag_year=year-1 then cons=1;
else cons=0; end;
else if lag_ID^=ID then do;
cons=1; end;
run;
My thought is to create another variable, named "chunk", which is 1 for all of the first chunk of an ID, then 2 for the next chunk, and so on. Once I have that, I can sum "cons" by ID and chunk to find the largest chunk for each ID. Creating the "chunk" variable is where I'm having trouble. I'm open to other ideas of how to find the largest chunk.
Here is a sample of the data:
data have;
informat ID 1. year 4.;
input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
In a single data step:
data want;
/*
ccs : current chunk start
cce : current chunk end
lcs : largest chunk start
lce : largest chunk end
*/
do until(last.id);
set have; by id;
if year-1 > cce then ccs = year;
cce = year;
if range(cce, lce) >= range (ccs, lcs) then do;
lce = cce;
lcs = ccs;
end;
end;
do until(last.id);
set have; by id;
inChunk = lcs <= year <= lce;
midChunk = year = round(mean(lcs, lce));
output;
end;
drop ccs cce lcs lce;
run;
HI @jss539 Assuming I understand your requirement, the following should give you the sequence to measure the chunk count. If this is correct, then next step is just to take the max by group.
data have;
informat ID 1. year 4.;
input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
data want;
set have;
by id;
k=dif(year);
if first.id then grp=0;
if first.id or k ne 1 then c=1;
else if k=1 then c+1;
if c=1 then grp+1;
drop k;
run;
Something like-
data have;
informat ID 1. year 4.;
input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
data want;
set have;
by id;
k=dif(year);
if first.id then grp=0;
if first.id or k ne 1 then c=1;
else if k=1 then c+1;
if c=1 then grp+1;
drop k c;
run;
proc sql;
create table final_want(drop=n) as
select *
from ( select *,count(grp) as n from want group by id, grp)
group by id
having max(n)=n;
quit;
ID | year |
---|---|
1 | 2004 |
1 | 2005 |
1 | 2006 |
1 | 2007 |
1 | 2008 |
2 | 2000 |
2 | 2001 |
2 | 2002 |
2 | 2003 |
2 | 2004 |
2 | 2005 |
2 | 2006 |
2 | 2007 |
2 | 2008 |
3 | 2001 |
3 | 2002 |
3 | 2003 |
3 | 2016 |
3 | 2017 |
3 | 2018 |
This will generate a record for each ID indicating the max streak, the year the streak ends and the size of the streak - so you can easily do the math to figure out the year. This uses basic SAS programming, nothing fancy here so it should be clear but if anything isn't feel free to ask.
proc sort data=have;
by ID year;
run;
data want;
set have;
by id;
retain max_streak max_year;
prev_year = lag(year);
if first.id then call missing(prev_year, max_streak, max_year, middle_year);
if year - 1 = lag(year) then streak+1;
else streak = 0;
if streak >= max_streak then do;
max_streak = streak;
max_year = year;
end;
if last.id then do;
middle_year = max_year - max_streak/2;
output;
end;
keep id max_streak max_year middle_year;
run;
In a single data step:
data want;
/*
ccs : current chunk start
cce : current chunk end
lcs : largest chunk start
lce : largest chunk end
*/
do until(last.id);
set have; by id;
if year-1 > cce then ccs = year;
cce = year;
if range(cce, lce) >= range (ccs, lcs) then do;
lce = cce;
lcs = ccs;
end;
end;
do until(last.id);
set have; by id;
inChunk = lcs <= year <= lce;
midChunk = year = round(mean(lcs, lce));
output;
end;
drop ccs cce lcs lce;
run;
Because you know the time domain of possible years (2000:2018) you can use an array indexed by year, with each element of the array being the number of consecutive years up to and including the element. After reading all the obs for an id, find the maximum size in the array, determine its element as the end year of the maximum size, and calculate the corresponding beginning year.
The reread all the years for the same id, keeping only those between maxsizbeg and maxsizend:
data have;
informat ID 1. year 4.;
input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
data want;
array siz {1999:2018} _temporary_;
set have;
by id;
if first.id then call missing(of siz{*});
siz{year} = sum(siz{year-1},1);
if last.id;
maxsiz=max(of siz{*});
maxsizbeg = lbound(siz) + whichn(maxsiz,of siz{*}) - maxsiz ;
maxsizend = maxsizbeg + maxsiz - 1;
do until (last.id); /* Reread and filter this id*/
set have;
by id;
if maxsizbeg<=year<=maxsizend then output;
end;
run;
The "trick" here is to create an array with a lower bound of one year prior to your earliest data (i.e. 1999) and an upper bound of the last year in your data (2018). (You could have an even smaller lower-bound and higher upper bound with no harm).
The statement :
siz{year} = sum(siz{year-1},1);
assigns a size value for the current year equal to one greater than the prior year's size value. But if the prior year is never encountered, then its size value is missing. Since the sum function of 1 plus missing is 1, it means the current size is 1 - i.e. the start of a new time span.
At the end of an id, get the maximum size, find out where it is in the array [whichn(maxsize,of siz{*})], then determine the corresponding maxsizbeg year and maxsizend year.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.