Hi!
I want to count the maximum number of consecutive 1:s in my dataset. See example of have and want dataset. My dataset(s) have millions of rows so I want to avoid transposing the data. I have seen similar solutions of this problem here but I don't quite understand them. Is there someone who can help me with the code needed to calculate the variable maxYears?
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
run;
data want;
input id yr2010-yr2019 maxYears;
datalines;
1 . . . . . . . . 1 1 2
2 1 1 1 . . . 1 1 1 1 4
3 1 1 1 1 1 . . . 1 1 5
4 . . . . . . 1 . . . 1
;
run;
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
data want;
set have;
temp=cats(of yr:);
pid=prxparse('/1+/');
s=1;e=length(temp);
call prxnext(pid,s,e,temp,p,l);
do while(p>0);
maxYears=max(maxYears,l);
call prxnext(pid,s,e,temp,p,l);
end;
drop pid s e p l temp;
run;
I have seen similar solutions of this problem here but I don't quite understand them.
Did you try them? Do they work? Don't make us re-invent a solution if there already is a solution (which we could then explain to you).
Here is a solution that could be explained:
Sorry, but I can't translate the similar solutions I have found to my specific problem.
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
data want;
set have;
temp=cats(of yr:);
pid=prxparse('/1+/');
s=1;e=length(temp);
call prxnext(pid,s,e,temp,p,l);
do while(p>0);
maxYears=max(maxYears,l);
call prxnext(pid,s,e,temp,p,l);
end;
drop pid s e p l temp;
run;
I think @Astounding 's code is better.
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
data want;
set have;
array yrs {*} yr2010 - yr2019;
maxyears = 0;
count = 0;
do _n_=1 to dim(yrs);
if yrs{_n_} = 1 then count + 1;
else do; maxyears = max(maxyears, count);
count = 0;
end;
end;
maxyears = max(maxyears, count);
drop count;
run;
A straightforward way:
data want;
set have;
array yrs {2010:2019} yr2010 - yr2019;
maxyears = 0;
count = 0;
do _n_=2010 to 2019;
if yrs{_n_} = 1 then count + 1;
else do;
maxyears = max(maxyears, count);
count = 0;
end;
end;
maxyears = max(maxyears, count);
drop count;
run;
I see your array statement, and raise it by a repeat function (😉😞
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
run;
data want (drop=_:);
set have;
length _strng $10;
_strng=cats(of yr:);
do LEN=length(_strng) to 1 by -1 while (find(_strng,repeat('1',LEN-1))=0);
end;
run;
If the absent years really are always represented with a standard missing value, then your could (1) concatenate all the year values into a character string, call it _STRNG, (2) calculate the length of each "word" in _STRNG, where a word is a substring containing no separators (no "." in this case):
data want ;
set have;
length _strng $10;
_strng=cats(of yr:);
if findc(_strng,'1') then do _w=1 to countw(_strng,'.');
maxyears=max(maxyears,length(scan(_strng,_w)));
end;
run;
Similar to what others already suggested with some logic added to derive the start and stop year for the longest period.
data have;
input id yr2010-yr2019;
length str $10 word $10;
array yrs {*} yr2010-yr2019;
/* find longest string */
str=cats(of yr:);
do i=1 by 1;
word=scan(str,i,,'kd');
if missing(word) then leave;
max_len=max(max_len,length(word));
end;
/* find start and stop variables (years) for longest period */
start=find(str,repeat('1',max_len-1));
stop=start+max_len-1;
from_year =input(compress(vname(yrs[start]),,'kd'),best32.);
to_year =input(compress(vname(yrs[stop]),,'kd'),best32.);
drop str word i start stop;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
Thanks for this extra info with start and end years, could turn out to be really useful for this research project.
@vanja BTW: It wasn't in your sample data but what should happen if there is more than one max period? Which one would you want to select?
@vanja Define how such cases should be treated.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.