BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jss539
Fluorite | Level 6

Hello,

 

I have a property-year level dataset that has inconsistent reporting between 2000 and 2018.  My goal is to look at each property (ID), and find the largest "chunk" of consecutive years, as well as mark the middle year of that chunk.

 

So far, I have created a variable that is 1 for consecutive years and 0 for non-consecutive years.  

 

proc sort data=have;
	by ID year;
run;

data have;
	set have;
	if lag_ID=ID then do;
		if lag_year=year-1 then cons=1;
			else cons=0; end;
		else if lag_ID^=ID then do;
			cons=1; end;
run;

My thought is to create another variable, named "chunk", which is 1 for all of the first chunk of an ID, then 2 for the next chunk, and so on.  Once I have that, I can sum "cons" by ID and chunk to find the largest chunk for each ID.  Creating the "chunk" variable is where I'm having trouble.  I'm open to other ideas of how to find the largest chunk.  

Here is a sample of the data:

data have;
	informat ID 1. year 4.;
	input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

In a single data step:

 


data want;
/*
    ccs : current chunk start
    cce : current chunk end
    lcs : largest chunk start
    lce : largest chunk end
    */
do until(last.id);
    set have; by id;
    if year-1 > cce then ccs = year;        
    cce = year;
    if range(cce, lce) >= range (ccs, lcs) then do;
        lce = cce;
        lcs = ccs;
        end;
    end;
do until(last.id);
    set have; by id;
    inChunk = lcs <= year <= lce; 
    midChunk = year = round(mean(lcs, lce));
    output;
    end;
drop ccs cce lcs lce;    
run;

PGStats_0-1611268143131.png

 

PG

View solution in original post

5 REPLIES 5
novinosrin
Tourmaline | Level 20

HI @jss539  Assuming I understand your requirement, the following should give you the sequence to measure the chunk count. If this is correct, then next step is just to take the max by group.


data have;
	informat ID 1. year 4.;
	input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
data want;
 set have;
 by id;
 k=dif(year);
 if first.id then grp=0;
 if first.id or k ne 1 then c=1;
 else if k=1 then c+1;
 if c=1 then grp+1;
 drop k;
run;
novinosrin
Tourmaline | Level 20

Something like-


data have;
	informat ID 1. year 4.;
	input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;
data want;
 set have;
 by id;
 k=dif(year);
 if first.id then grp=0;
 if first.id or k ne 1 then c=1;
 else if k=1 then c+1;
 if c=1 then grp+1;
 drop k c;
run;

proc sql;
 create table final_want(drop=n) as
 select *
 from ( select *,count(grp) as n from want group by id, grp)
 group by id
 having max(n)=n;
quit;
ID year
1 2004
1 2005
1 2006
1 2007
1 2008
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
Reeza
Super User

This will generate a record for each ID indicating the max streak, the year the streak ends and the size of the streak - so you can easily do the math to figure out the year. This uses basic SAS programming, nothing fancy here so it should be clear but if anything isn't feel free to ask. 

 

proc sort data=have;
	by ID year;
run;

data want;
	set have;
	by id;
	
	retain max_streak max_year;

    prev_year = lag(year);
    
    if first.id then call missing(prev_year, max_streak, max_year, middle_year);
    
    if year - 1 = lag(year) then streak+1;
    else streak = 0;
    
    if streak >= max_streak then do;
           max_streak = streak;
           max_year = year;
    end;
    
    

     if last.id then do;
         middle_year = max_year - max_streak/2;
         output;
     end;
     
     keep id max_streak max_year middle_year;
run;
PGStats
Opal | Level 21

In a single data step:

 


data want;
/*
    ccs : current chunk start
    cce : current chunk end
    lcs : largest chunk start
    lce : largest chunk end
    */
do until(last.id);
    set have; by id;
    if year-1 > cce then ccs = year;        
    cce = year;
    if range(cce, lce) >= range (ccs, lcs) then do;
        lce = cce;
        lcs = ccs;
        end;
    end;
do until(last.id);
    set have; by id;
    inChunk = lcs <= year <= lce; 
    midChunk = year = round(mean(lcs, lce));
    output;
    end;
drop ccs cce lcs lce;    
run;

PGStats_0-1611268143131.png

 

PG
mkeintz
PROC Star

Because you know the time domain of possible years (2000:2018) you can use an array indexed by year, with each element of the array being the number of consecutive years up to and including the element.  After reading all the obs for an id, find the maximum size in the array, determine its element as the end year of the maximum size, and calculate the corresponding beginning year.

 

The reread all the years for the same id, keeping only those between maxsizbeg and maxsizend:

 

data have;
    informat ID 1. year 4.;
	input ID year;
cards;
1 2000
1 2001
1 2002
1 2004
1 2005
1 2006
1 2007
1 2008
1 2010
1 2011
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007
2 2008
2 2010
2 2017
3 2001
3 2002
3 2003
3 2016
3 2017
3 2018
;

  
data want;
  array siz {1999:2018} _temporary_; 

  set have;
  by id;
  if first.id then call missing(of siz{*});

  siz{year} = sum(siz{year-1},1);

  if last.id;
  maxsiz=max(of siz{*});
  maxsizbeg = lbound(siz) + whichn(maxsiz,of siz{*}) - maxsiz ;
  maxsizend = maxsizbeg + maxsiz - 1;

  do until (last.id); /* Reread and filter this id*/
    set have;
    by id;
    if maxsizbeg<=year<=maxsizend then output;
  end;
run;

The "trick" here is to create an array with a lower bound of one year prior to your earliest data (i.e. 1999) and an upper bound of the last year in your data (2018).  (You could have an even smaller lower-bound and higher upper bound with no harm).  

 

The statement :

siz{year} = sum(siz{year-1},1);

assigns a size value for the current year equal to one greater than the prior year's size value.  But if the prior year is never encountered, then its size value is missing.  Since the sum function of 1 plus missing is 1, it means the current size is 1 - i.e. the start of a new time span.

 

At the end of an id, get the maximum size, find out where it is in the array [whichn(maxsize,of siz{*})], then determine the corresponding maxsizbeg year and maxsizend year.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 822 views
  • 2 likes
  • 5 in conversation