Solved: Finding the largest "chunk" of consecutive year for each ID

jss539 · Posted 01-21-2021 02:51 PM

Hello,

I have a property-year level dataset that has inconsistent reporting between 2000 and 2018. My goal is to look at each property (ID), and find the largest "chunk" of consecutive years, as well as mark the middle year of that chunk.

So far, I have created a variable that is 1 for consecutive years and 0 for non-consecutive years.

proc sort data=have;
	by ID year;
run;

data have;
	set have;
	if lag_ID=ID then do;
		if lag_year=year-1 then cons=1;
			else cons=0; end;
		else if lag_ID^=ID then do;
			cons=1; end;
run;

My thought is to create another variable, named "chunk", which is 1 for all of the first chunk of an ID, then 2 for the next chunk, and so on. Once I have that, I can sum "cons" by ID and chunk to find the largest chunk for each ID. Creating the "chunk" variable is where I'm having trouble. I'm open to other ideas of how to find the largest chunk.

Here is a sample of the data:

data have;
	informat ID 1. year 4.;
	input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;

PGStats · Posted 01-21-2021 05:29 PM

In a single data step:


data want;
/*
    ccs : current chunk start
    cce : current chunk end
    lcs : largest chunk start
    lce : largest chunk end
    */
do until(last.id);
    set have; by id;
    if year-1 > cce then ccs = year;        
    cce = year;
    if range(cce, lce) >= range (ccs, lcs) then do;
        lce = cce;
        lcs = ccs;
        end;
    end;
do until(last.id);
    set have; by id;
    inChunk = lcs <= year <= lce; 
    midChunk = year = round(mean(lcs, lce));
    output;
    end;
drop ccs cce lcs lce;    
run;

PG

View solution in original post

novinosrin · Posted 01-21-2021 03:43 PM

HI @jss539 Assuming I understand your requirement, the following should give you the sequence to measure the chunk count. If this is correct, then next step is just to take the max by group.


data have;
	informat ID 1. year 4.;
	input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
data want;
 set have;
 by id;
 k=dif(year);
 if first.id then grp=0;
 if first.id or k ne 1 then c=1;
 else if k=1 then c+1;
 if c=1 then grp+1;
 drop k;
run;

novinosrin · Posted 01-21-2021 03:49 PM

Something like-


data have;
	informat ID 1. year 4.;
	input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
data want;
 set have;
 by id;
 k=dif(year);
 if first.id then grp=0;
 if first.id or k ne 1 then c=1;
 else if k=1 then c+1;
 if c=1 then grp+1;
 drop k c;
run;

proc sql;
 create table final_want(drop=n) as
 select *
 from ( select *,count(grp) as n from want group by id, grp)
 group by id
 having max(n)=n;
quit;

ID	year
1	2004
1	2005
1	2006
1	2007
1	2008
2	2000
2	2001
2	2002
2	2003
2	2004
2	2005
2	2006
2	2007
2	2008
3	2001
3	2002
3	2003
3	2016
3	2017
3	2018

Reeza · Posted 01-21-2021 03:50 PM

This will generate a record for each ID indicating the max streak, the year the streak ends and the size of the streak - so you can easily do the math to figure out the year. This uses basic SAS programming, nothing fancy here so it should be clear but if anything isn't feel free to ask.

proc sort data=have;
	by ID year;
run;

data want;
	set have;
	by id;
	
	retain max_streak max_year;

    prev_year = lag(year);
    
    if first.id then call missing(prev_year, max_streak, max_year, middle_year);
    
    if year - 1 = lag(year) then streak+1;
    else streak = 0;
    
    if streak >= max_streak then do;
           max_streak = streak;
           max_year = year;
    end;
    
    

     if last.id then do;
         middle_year = max_year - max_streak/2;
         output;
     end;
     
     keep id max_streak max_year middle_year;
run;

PGStats · Posted 01-21-2021 05:29 PM

In a single data step:


data want;
/*
    ccs : current chunk start
    cce : current chunk end
    lcs : largest chunk start
    lce : largest chunk end
    */
do until(last.id);
    set have; by id;
    if year-1 > cce then ccs = year;        
    cce = year;
    if range(cce, lce) >= range (ccs, lcs) then do;
        lce = cce;
        lcs = ccs;
        end;
    end;
do until(last.id);
    set have; by id;
    inChunk = lcs <= year <= lce; 
    midChunk = year = round(mean(lcs, lce));
    output;
    end;
drop ccs cce lcs lce;    
run;

PG

mkeintz · Posted 01-21-2021 08:20 PM

Because you know the time domain of possible years (2000:2018) you can use an array indexed by year, with each element of the array being the number of consecutive years up to and including the element. After reading all the obs for an id, find the maximum size in the array, determine its element as the end year of the maximum size, and calculate the corresponding beginning year.

The reread all the years for the same id, keeping only those between maxsizbeg and maxsizend:

data have;
    informat ID 1. year 4.;
	input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;

  
data want;
  array siz {1999:2018} _temporary_; 

  set have;
  by id;
  if first.id then call missing(of siz{*});

  siz{year} = sum(siz{year-1},1);

  if last.id;
  maxsiz=max(of siz{*});
  maxsizbeg = lbound(siz) + whichn(maxsiz,of siz{*}) - maxsiz ;
  maxsizend = maxsizbeg + maxsiz - 1;

  do until (last.id); /* Reread and filter this id*/
    set have;
    by id;
    if maxsizbeg<=year<=maxsizend then output;
  end;
run;

The "trick" here is to create an array with a lower bound of one year prior to your earliest data (i.e. 1999) and an upper bound of the last year in your data (2018). (You could have an even smaller lower-bound and higher upper bound with no harm).

The statement :

siz{year} = sum(siz{year-1},1);

assigns a size value for the current year equal to one greater than the prior year's size value. But if the prior year is never encountered, then its size value is missing. Since the sum function of 1 plus missing is 1, it means the current size is 1 - i.e. the start of a new time span.

At the end of an id, get the maximum size, find out where it is in the array [whichn(maxsize,of siz{*})], then determine the corresponding maxsizbeg year and maxsizend year.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Re: Finding the largest "chunk" of consecutive year for each ID

Registration is open

SAS Training: Just a Click Away