Hi!
I want to count the maximum number of consecutive 1:s in my dataset. See example of have and want dataset. My dataset(s) have millions of rows so I want to avoid transposing the data. I have seen similar solutions of this problem here but I don't quite understand them. Is there someone who can help me with the code needed to calculate the variable maxYears?
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
run;
data want;
input id yr2010-yr2019 maxYears;
datalines;
1 . . . . . . . . 1 1 2
2 1 1 1 . . . 1 1 1 1 4
3 1 1 1 1 1 . . . 1 1 5
4 . . . . . . 1 . . . 1
;
run;
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
data want;
set have;
temp=cats(of yr:);
pid=prxparse('/1+/');
s=1;e=length(temp);
call prxnext(pid,s,e,temp,p,l);
do while(p>0);
maxYears=max(maxYears,l);
call prxnext(pid,s,e,temp,p,l);
end;
drop pid s e p l temp;
run;
I have seen similar solutions of this problem here but I don't quite understand them.
Did you try them? Do they work? Don't make us re-invent a solution if there already is a solution (which we could then explain to you).
Here is a solution that could be explained:
Sorry, but I can't translate the similar solutions I have found to my specific problem.
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
data want;
set have;
temp=cats(of yr:);
pid=prxparse('/1+/');
s=1;e=length(temp);
call prxnext(pid,s,e,temp,p,l);
do while(p>0);
maxYears=max(maxYears,l);
call prxnext(pid,s,e,temp,p,l);
end;
drop pid s e p l temp;
run;
I think @Astounding 's code is better.
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
data want;
set have;
array yrs {*} yr2010 - yr2019;
maxyears = 0;
count = 0;
do _n_=1 to dim(yrs);
if yrs{_n_} = 1 then count + 1;
else do; maxyears = max(maxyears, count);
count = 0;
end;
end;
maxyears = max(maxyears, count);
drop count;
run;
A straightforward way:
data want;
set have;
array yrs {2010:2019} yr2010 - yr2019;
maxyears = 0;
count = 0;
do _n_=2010 to 2019;
if yrs{_n_} = 1 then count + 1;
else do;
maxyears = max(maxyears, count);
count = 0;
end;
end;
maxyears = max(maxyears, count);
drop count;
run;
I see your array statement, and raise it by a repeat function (😉😞
data have;
input id yr2010-yr2019;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
run;
data want (drop=_:);
set have;
length _strng $10;
_strng=cats(of yr:);
do LEN=length(_strng) to 1 by -1 while (find(_strng,repeat('1',LEN-1))=0);
end;
run;
If the absent years really are always represented with a standard missing value, then your could (1) concatenate all the year values into a character string, call it _STRNG, (2) calculate the length of each "word" in _STRNG, where a word is a substring containing no separators (no "." in this case):
data want ;
set have;
length _strng $10;
_strng=cats(of yr:);
if findc(_strng,'1') then do _w=1 to countw(_strng,'.');
maxyears=max(maxyears,length(scan(_strng,_w)));
end;
run;
Similar to what others already suggested with some logic added to derive the start and stop year for the longest period.
data have;
input id yr2010-yr2019;
length str $10 word $10;
array yrs {*} yr2010-yr2019;
/* find longest string */
str=cats(of yr:);
do i=1 by 1;
word=scan(str,i,,'kd');
if missing(word) then leave;
max_len=max(max_len,length(word));
end;
/* find start and stop variables (years) for longest period */
start=find(str,repeat('1',max_len-1));
stop=start+max_len-1;
from_year =input(compress(vname(yrs[start]),,'kd'),best32.);
to_year =input(compress(vname(yrs[stop]),,'kd'),best32.);
drop str word i start stop;
datalines;
1 . . . . . . . . 1 1
2 1 1 1 . . . 1 1 1 1
3 1 1 1 1 1 . . . 1 1
4 . . . . . . 1 . . .
;
Thanks for this extra info with start and end years, could turn out to be really useful for this research project.
@vanja BTW: It wasn't in your sample data but what should happen if there is more than one max period? Which one would you want to select?
@vanja Define how such cases should be treated.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.