Hello,
I have a simple data here, showing years of eligibility.
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
run;
I want to crate a data, grouping a continuous eligibility.
So, a new data set should look like this:
id fromyear toyear
1 1965 1968
1 2000 2001
2 1967 2000
Could anyone tell me how to get this?
Thank you,
Yoko
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
run;
data w;
set mydata;
by id;
retain c1 c2;
lag=lag(toyear);
if first.id then do; c1=fromyear;c2=toyear;end;
else if fromyear-lag in (1,0) then c2=toyear;
else do; c1=fromyear;c2=toyear;end;
keep id c:;
run;
data want;
set w;
by id c1 notsorted;
if last.c1;
run;
Try next not tested code:
data want;
set have(rename=(year1=fromyear year2=toyear));
by id;
retain year1 year2;
keep id year1 year2;
if first.if then do;
year1 = fromyear;
year2 = toyear;
end;
else do;
if last.id then do;
year2 = toyear;
output;
end; else
if fromyear ^= (year2 + 1)
then do;
output;
year1 = fromyear;
year2 = toyear;
end;
else year2 = toyear;
end;
run;
Thank you for your suggestion. Unfortunately, it did not work. But, your suggestion seems to be similar to the one suggested by novinosrin.
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
run;
data w;
set mydata;
by id;
retain c1 c2;
lag=lag(toyear);
if first.id then do; c1=fromyear;c2=toyear;end;
else if fromyear-lag in (1,0) then c2=toyear;
else do; c1=fromyear;c2=toyear;end;
keep id c:;
run;
data want;
set w;
by id c1 notsorted;
if last.c1;
run;
This works even if your eligibility periods overlap:
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
data myNewData;
array y {1950:2050};
do until(last.id);
set myData; by id;
do i = fromYear to toYear;
y{i} = 1;
end;
end;
call missing(fromYear, toYear);
do i = lbound(y) to hbound(y);
if y{i} then do;
if missing(fromYear) then fromYear = i;
end;
else do;
if not missing(fromYear) then do;
toYear = i - 1;
output;
call missing(fromYear, toYear);
end;
end;
end;
drop i y:;
run;
proc print data=myNewData noobs; run;
Sir ProdigyGeniusstats aka PG, shot!!!!!!!!!!
linear,neat and stroke!!!!!!!
Understood the idea. Class act!!!!!!!!!!!
Would a temp array make it even faster?
array y {1950:2050} _temporary_;
Probably, a bit. But you would have to explicitly reset the array to missing on every data step iteration.
Oh yes true
Thank you for your suggestions. Both worked!
This solution relies on understanding the "firstobs=2" option, and on how to populate the program data vector:
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
data want (drop=_:);
set mydata (keep=id);
by id;
retain fromyear;
merge mydata (rename=(fromyear=_curr_fromyear))
mydata (firstobs=2 keep=fromyear rename=(fromyear=_nxt_fromyear));
if first.id or _curr_fromyear>sum(lag(toyear),1) then fromyear=_curr_fromyear;
if last.id or _nxt_fromyear>toyear+1;
run;
Editted note. This will also work if periods overlap, as long as (1) data is sorted by ID/FROMYEAR, and (2) no TOYEAR precedes TOYEAR from a prior record. (i.e. data is also sorted by ID/TOYEAR).
Taking care of every sort of overlap can also be done while reading the data only once with:
data myNewData2;
do until (last.id);
set myData; by id fromYear; /* Check: fromYear is not decreasing */
if fromYear - 1 > ty then do;
if not missing(ty) then output;
fy = fromYear; ty = toYear;
end;
else ty = max(toYear, ty);
end;
output;
drop fromYear toYear;
rename fy=fromYear ty=toYear;
run;
I agree that "reading" a data set twice risks excessive use of disk input activity, and therefore should be avoided, as your program does. But just to be clear, the presence of both a SET and a MERGE statement in the program I suggested should not represent much increased disk input activity, because of how the operating system satisfies disk input requests.
Because the SET and MERGE are tightly synchronized in this program, they are reading from the same disk page. So once a page of data is put in memory by the operating system, both the SET and MERGE will populate the program data vector from that memory page. So yes, there are two streams of data established in the program, but the impact on actual resource use should be minimal. I haven't tested this understanding for quite a while, but I'd be surprised if there would be much increase in disk transfers in the example I suggested.
I appreciate you raising the issue. I should have put a caveat with my program, because I'm sure that many sas users would see my code as having the same resource impact as two independent complete loops through the data - one for the SET, and one for the MERGE.
Thank you for your suggestion. It worked!
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
data temp;
set mydata;
by id;
if first.id or fromyear-lag(toyear)>1 then group+1;
run;
data want;
set temp;
by group;
retain _fromyear;
if first.group then _fromyear=fromyear;
if last.group;
drop fromyear group;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: