BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yoko
Obsidian | Level 7

Hello, 

I have a simple data here, showing years of eligibility. 

 

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001

2 1967 2000
;
run;

 

I want to crate a data, grouping a continuous eligibility. 

So, a new data set should look like this: 

id fromyear toyear

1 1965 1968

1 2000 2001

2 1967 2000

 

Could anyone tell me how to get this? 

 

Thank you, 

 

Yoko

 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
run;

 

data w;
set mydata;
by id;
retain c1 c2;
lag=lag(toyear);
if first.id then do; c1=fromyear;c2=toyear;end;
else if fromyear-lag in (1,0) then c2=toyear;
else  do; c1=fromyear;c2=toyear;end;
keep id c:;
run;

data want;
set w;
by id c1 notsorted;
if last.c1;
run;

View solution in original post

13 REPLIES 13
Shmuel
Garnet | Level 18

Try next not tested code:

data want;
 set have(rename=(year1=fromyear year2=toyear));
   by id;
        retain year1 year2;
        keep id year1 year2;

        if first.if then do;
           year1 = fromyear;
           year2 = toyear;
       end;
       else do;
          if last.id then do;
year2 = toyear;
output;
end; else if fromyear ^= (year2 + 1) then do; output; year1 = fromyear; year2 = toyear; end; else year2 = toyear; end; run;
Yoko
Obsidian | Level 7

Thank you for your suggestion.  Unfortunately, it did not work.  But, your suggestion seems to be similar to the one suggested by novinosrin. 

novinosrin
Tourmaline | Level 20
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
run;

 

data w;
set mydata;
by id;
retain c1 c2;
lag=lag(toyear);
if first.id then do; c1=fromyear;c2=toyear;end;
else if fromyear-lag in (1,0) then c2=toyear;
else  do; c1=fromyear;c2=toyear;end;
keep id c:;
run;

data want;
set w;
by id c1 notsorted;
if last.c1;
run;
PGStats
Opal | Level 21

This works even if your eligibility periods overlap:

 

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;

data myNewData;
array y {1950:2050};
do until(last.id);
    set myData; by id;
    do i = fromYear to toYear;
        y{i} = 1;
        end;
    end;
call missing(fromYear, toYear);
do i = lbound(y) to hbound(y);
    if y{i} then do;
        if missing(fromYear) then fromYear = i;
        end;
    else do;
        if not missing(fromYear) then do;
            toYear = i - 1;
            output;
            call missing(fromYear, toYear);
            end;
        end;
    end;
drop i y:;
run;


proc print data=myNewData noobs; run;
PG
novinosrin
Tourmaline | Level 20

Sir ProdigyGeniusstats aka PG, shot!!!!!!!!!!

linear,neat and stroke!!!!!!!

 

Understood the idea. Class act!!!!!!!!!!!

 

 

Would a temp array make it even faster?

 

array y {1950:2050} _temporary_;

 

PGStats
Opal | Level 21

Probably, a bit. But you would have to explicitly reset the array to missing on every data step iteration.

PG
novinosrin
Tourmaline | Level 20

Oh yes true 

Yoko
Obsidian | Level 7

Thank you for your suggestions.  Both worked!  

mkeintz
PROC Star

This solution relies on understanding the "firstobs=2" option, and on how to populate the program data vector:

 

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;

data want (drop=_:);
  set mydata (keep=id);
  by id;

  retain fromyear;

  merge mydata (rename=(fromyear=_curr_fromyear))
        mydata (firstobs=2 keep=fromyear rename=(fromyear=_nxt_fromyear));

  if first.id or _curr_fromyear>sum(lag(toyear),1) then fromyear=_curr_fromyear;
  if last.id or _nxt_fromyear>toyear+1;

run;

 

  1. The "set mydata (keep=id); by id;" statements are there simply to provide access to the first.id and last.id dummies.
  2. The merge of mydata with itself has the second instance of mydata with "firstobs=2".  So you're reading obs 1 and 2 together, then obs 2 and 3 together, then 3 and 4, etc.  But of course they have the same variables, so to avoid collisions the current record (without the firstobs=2) renames fromyear to _curr_fromyear, and the following record (with firstobs=2) renames it to _nxt_curryear.
  3. Whenever you begin a new id, of have a record with _curr_fromyear more than a year after the prior toyear, [specified as lag(toyear)], then you're at the beginning of a time span, so assign a value to fromyear, which is retained across observations.
  4. And whenever you encounter the end of an id, or find the next fromyear more than a year after the current toyear, then allow the observation through the subsetting IF statement for output to want.

Editted note.  This will also work if periods overlap, as long as (1) data is sorted by ID/FROMYEAR, and (2) no TOYEAR precedes TOYEAR from a prior record.  (i.e. data is also sorted by ID/TOYEAR). 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

Taking care of every sort of overlap can also be done while reading the data only once with:

 

data myNewData2;
do until (last.id);
    set myData; by id fromYear; /* Check: fromYear is not decreasing */
    if fromYear - 1 > ty then do;
        if not missing(ty) then output;
        fy = fromYear; ty = toYear;
        end;
    else ty = max(toYear, ty);
    end;
output;
drop fromYear toYear;
rename fy=fromYear ty=toYear;
run;
PG
mkeintz
PROC Star

@PGStats

 

I agree that "reading" a data set twice risks excessive use of disk input activity, and therefore should be avoided, as your program does.  But just to be clear, the presence of both a SET and a MERGE statement in the program I suggested should not represent much increased disk input activity, because of how the operating system satisfies disk input requests.

 

Because the SET and MERGE are tightly synchronized in this program, they are reading from the same disk page.  So once a page of data is put in memory by the operating system, both the SET and MERGE will populate the program data vector from that memory page.  So yes, there are two streams of data established in the program, but the impact on actual resource use should be minimal.  I haven't tested this understanding for quite a while, but I'd be surprised if there would be much increase in disk transfers in the example I suggested.

 

I appreciate you raising the issue.  I should have put a caveat with my program, because I'm sure that many sas users would see my code as having the same resource impact as two independent complete loops through the data - one for the SET, and one for the MERGE.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Yoko
Obsidian | Level 7

Thank you for your suggestion.  It worked!  

Ksharp
Super User
DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
data temp;
 set mydata;
 by id;
 if first.id or fromyear-lag(toyear)>1 then group+1;
run;
data want;
 set temp;
 by group;
 retain _fromyear;
 if first.group then _fromyear=fromyear;
 if last.group;
 drop fromyear group;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2159 views
  • 2 likes
  • 6 in conversation