Solved: Re: Check continuous year/date

Yoko · Posted 11-13-2018 02:43 PM

Hello,

I have a simple data here, showing years of eligibility.

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001

2 1967 2000
;
run;

I want to crate a data, grouping a continuous eligibility.

So, a new data set should look like this:

id fromyear toyear

1 1965 1968

1 2000 2001

2 1967 2000

Could anyone tell me how to get this?

Thank you,

Yoko

novinosrin · Posted 11-13-2018 05:00 PM

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
run;

 

data w;
set mydata;
by id;
retain c1 c2;
lag=lag(toyear);
if first.id then do; c1=fromyear;c2=toyear;end;
else if fromyear-lag in (1,0) then c2=toyear;
else  do; c1=fromyear;c2=toyear;end;
keep id c:;
run;

data want;
set w;
by id c1 notsorted;
if last.c1;
run;

View solution in original post

Shmuel · Posted 11-13-2018 03:18 PM

Try next not tested code:

data want;
 set have(rename=(year1=fromyear year2=toyear));
   by id;
        retain year1 year2;
        keep id year1 year2;

        if first.if then do;
           year1 = fromyear;
           year2 = toyear;
       end;
       else do;
          if last.id then do;
             year2 = toyear;
             output;  
          end; else
          if fromyear ^= (year2 + 1)
          then do;
                  output;
                  year1 = fromyear;
                  year2 = toyear;
          end;
          else  year2 = toyear;
      end;
run;

Yoko · Posted 11-14-2018 08:34 AM

Thank you for your suggestion. Unfortunately, it did not work. But, your suggestion seems to be similar to the one suggested by novinosrin.

novinosrin · Posted 11-13-2018 05:00 PM

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
run;

 

data w;
set mydata;
by id;
retain c1 c2;
lag=lag(toyear);
if first.id then do; c1=fromyear;c2=toyear;end;
else if fromyear-lag in (1,0) then c2=toyear;
else  do; c1=fromyear;c2=toyear;end;
keep id c:;
run;

data want;
set w;
by id c1 notsorted;
if last.c1;
run;

PGStats · Posted 11-13-2018 06:17 PM

This works even if your eligibility periods overlap:

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;

data myNewData;
array y {1950:2050};
do until(last.id);
    set myData; by id;
    do i = fromYear to toYear;
        y{i} = 1;
        end;
    end;
call missing(fromYear, toYear);
do i = lbound(y) to hbound(y);
    if y{i} then do;
        if missing(fromYear) then fromYear = i;
        end;
    else do;
        if not missing(fromYear) then do;
            toYear = i - 1;
            output;
            call missing(fromYear, toYear);
            end;
        end;
    end;
drop i y:;
run;


proc print data=myNewData noobs; run;

PG

novinosrin · Posted 11-13-2018 06:32 PM

Sir ProdigyGeniusstats aka PG, shot!!!!!!!!!!

linear,neat and stroke!!!!!!!

Understood the idea. Class act!!!!!!!!!!!

Would a temp array make it even faster?

array y {1950:2050} _temporary_;

PGStats · Posted 11-13-2018 06:40 PM

Probably, a bit. But you would have to explicitly reset the array to missing on every data step iteration.

PG

novinosrin · Posted 11-13-2018 06:40 PM

Oh yes true

Yoko · Posted 11-14-2018 08:36 AM

Thank you for your suggestions. Both worked!

mkeintz · Posted 11-14-2018 12:08 AM

This solution relies on understanding the "firstobs=2" option, and on how to populate the program data vector:

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;

data want (drop=_:);
  set mydata (keep=id);
  by id;

  retain fromyear;

  merge mydata (rename=(fromyear=_curr_fromyear))
        mydata (firstobs=2 keep=fromyear rename=(fromyear=_nxt_fromyear));

  if first.id or _curr_fromyear>sum(lag(toyear),1) then fromyear=_curr_fromyear;
  if last.id or _nxt_fromyear>toyear+1;

run;

The "set mydata (keep=id); by id;" statements are there simply to provide access to the first.id and last.id dummies.
The merge of mydata with itself has the second instance of mydata with "firstobs=2". So you're reading obs 1 and 2 together, then obs 2 and 3 together, then 3 and 4, etc. But of course they have the same variables, so to avoid collisions the current record (without the firstobs=2) renames fromyear to _curr_fromyear, and the following record (with firstobs=2) renames it to _nxt_curryear.
Whenever you begin a new id, of have a record with _curr_fromyear more than a year after the prior toyear, [specified as lag(toyear)], then you're at the beginning of a time span, so assign a value to fromyear, which is retained across observations.
And whenever you encounter the end of an id, or find the next fromyear more than a year after the current toyear, then allow the observation through the subsetting IF statement for output to want.

Editted note. This will also work if periods overlap, as long as (1) data is sorted by ID/FROMYEAR, and (2) no TOYEAR precedes TOYEAR from a prior record. (i.e. data is also sorted by ID/TOYEAR).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PGStats · Posted 11-14-2018 01:10 AM

Taking care of every sort of overlap can also be done while reading the data only once with:

data myNewData2;
do until (last.id);
    set myData; by id fromYear; /* Check: fromYear is not decreasing */
    if fromYear - 1 > ty then do;
        if not missing(ty) then output;
        fy = fromYear; ty = toYear;
        end;
    else ty = max(toYear, ty);
    end;
output;
drop fromYear toYear;
rename fy=fromYear ty=toYear;
run;

PG

mkeintz · Posted 11-14-2018 12:31 PM

@PGStats

I agree that "reading" a data set twice risks excessive use of disk input activity, and therefore should be avoided, as your program does. But just to be clear, the presence of both a SET and a MERGE statement in the program I suggested should not represent much increased disk input activity, because of how the operating system satisfies disk input requests.

Because the SET and MERGE are tightly synchronized in this program, they are reading from the same disk page. So once a page of data is put in memory by the operating system, both the SET and MERGE will populate the program data vector from that memory page. So yes, there are two streams of data established in the program, but the impact on actual resource use should be minimal. I haven't tested this understanding for quite a while, but I'd be surprised if there would be much increase in disk transfers in the example I suggested.

I appreciate you raising the issue. I should have put a caveat with my program, because I'm sure that many sas users would see my code as having the same resource impact as two independent complete loops through the data - one for the SET, and one for the MERGE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Yoko · Posted 11-14-2018 08:36 AM

Thank you for your suggestion. It worked!

Ksharp · Posted 11-14-2018 08:51 AM

DATA mydata;
input id fromyear toyear;
DATALINES;
1 1965 1966
1 1967 1968
1 2000 2001
2 1967 2000
;
data temp;
 set mydata;
 by id;
 if first.id or fromyear-lag(toyear)>1 then group+1;
run;
data want;
 set temp;
 by group;
 retain _fromyear;
 if first.group then _fromyear=fromyear;
 if last.group;
 drop fromyear group;
run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away