Solved: Re: Creating time intervals based on two date variables

Mahtalii · Posted 10-23-2017 04:11 PM

Hi,
I have a database of individuals' addresses in SAS like this:

OBS ID addresstype effectivedate enddate postalcode
1 1 M Feb2000 Jan2003 A1A1A1
2 1 R Jan2003 Jan2020 B2B2B2
3 1 M Feb2016 Sep2016 C3C3C3
4 1 M Sep2016 Jan2020 D4D4D4

Addresses are either residential (R) or mailing (M).
For each individual and for each period of time, if I have the residential address I want to keep that and delete any mailing address in that period of time. If for a period of time there is no residential address recorded, then I will keep the mailing address. For the individual of this example I want the clean data to look like this:

OBS ID addresstype effectivedate enddate postalcode
1 1 M Feb2000 Jan2003 A1A1A1
2 1 R Jan2003 Jan2020 B2B2B2

Is there any way in SAS to create intervals of time based on two dates (effectivedate and enddate in this example) and then compare them against each other. Thanks in advance for your help.

PGStats · Posted 10-26-2017 10:38 PM

Here is a better optimized version which requires less disk space

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
5       2        M             Feb2000      Jan2003    A1A1A1
6       2        R             Feb2000      Jan2003    E1E1E1
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;

do month = 1 to dim(M);
    M{month} = coalesce(R{month}, M{month});
    end;

call missing(obs);
do month = 1 to dim(M);
    if missing(obs) then do;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    else if missing(M{month}) then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        call missing(obs);
        end;
    else if obs ne M{month} then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    end;

format start finish monyy7.;
keep ID start finish obs;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    A.obs,
    A.ID,
    have.addressType,
    A.start,
    A.finish,
    have.postalCode
from 
    A inner join
    have on A.ID=have.ID and A.obs=have.obs;
quit;

proc print data=want noobs; run;

PG

View solution in original post

Jim_G · Posted 10-23-2017 05:07 PM

If your two dates are in SAS date informat, the INTCK function returns the number of days between the two dates.

duration=intck('day',effectivedate,enddate);

I think that you are going to want some sorting and logic to see if the shorter dates completely fit inside the longer periods.

Maybe sort by ID descending addrestype effectivedate duration. then delete observations that are overlapped.

jim

PGStats · Posted 10-24-2017 01:43 AM

Here is a solution using arrays. Array elements represent months. This solution will accomodate any type of overlap between time periods:

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;
do month = 1 to dim(M);
    if not missing(R{month}) then do;
        obs = R{month};
        output;
        end;
    else if not missing(M{month}) then do;
        obs = M{month};
        output;
        end;
    end;
keep id month obs;
run;

proc sort data=A; by id obs month; run;

/* Find beginning and end of known address periods. 
   Convert back nb of months to dates. */
data B;
set A;
retain start;
by id obs;
if first.obs then 
    start = intnx("MONTH", "&startOfTime."d, month);
if last.obs then do;
    finish = intnx("MONTH", "&startOfTime."d, month);
    output;
    end;
format start finish monyy7.;
drop month;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    B.obs,
    B.ID,
    have.addressType,
    B.start,
    B.finish,
    have.postalCode
from 
    B inner join
    have on B.ID=have.ID and B.obs=have.obs;
quit;

proc print data=want noobs; run;

PG

Mahtalii · Posted 10-26-2017 01:45 PM

Thanks @PGStats ! This is great! However, with the big table that I have, I keep running out of space and I need to find a way to deal with that. Do you have any suggestions? Thanks again for your help.

PGStats · Posted 10-26-2017 02:55 PM

Could you post the log (using the {i} icon above) so that we can see where the space is used up?

PG

Mahtalii · Posted 10-30-2017 02:15 PM

Hi @PGStats, Sorry for my late reply. I made my data smaller by deleting very old records that were not needed, and the code worked perfectly. But since you asked, here is the log for the error I received when I was trying to run the code on the whole big table (with dates from 1900 to 2060). Thanks again,

ERROR: Insufficient space in file WORK.A.DATA.
NOTE: The DATA step has been abnormally terminated.
NOTE: There were 7571824 observations read from the data set WORK.Have.
WARNING: The data set WORK.A may be incomplete.  When this step was stopped there were
         4865641041 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           28:54.77
      user cpu time       6:26.15
      system cpu time     55.31 seconds
      memory              1823.15k
      OS Memory           32140.00k
      Timestamp           2017-10-27 09:50:11 AM
      Step Count                        50  Switch Count  6

PGStats · Posted 10-26-2017 10:38 PM

Here is a better optimized version which requires less disk space

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
5       2        M             Feb2000      Jan2003    A1A1A1
6       2        R             Feb2000      Jan2003    E1E1E1
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;

do month = 1 to dim(M);
    M{month} = coalesce(R{month}, M{month});
    end;

call missing(obs);
do month = 1 to dim(M);
    if missing(obs) then do;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    else if missing(M{month}) then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        call missing(obs);
        end;
    else if obs ne M{month} then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    end;

format start finish monyy7.;
keep ID start finish obs;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    A.obs,
    A.ID,
    have.addressType,
    A.start,
    A.finish,
    have.postalCode
from 
    A inner join
    have on A.ID=have.ID and A.obs=have.obs;
quit;

proc print data=want noobs; run;

PG

Ksharp · Posted 10-24-2017 10:54 AM

How about this one ?



data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
;

proc sql;
create table key as 
select b.obs as obs
 from have as a,have as b
  where a.id=b.id and a.start <=b.start and a.finish >=b.finish and a.obs ne b.obs ;
  
create table want as
select *
 from have
  where obs not in (select obs from key);
quit;

Mahtalii · Posted 10-26-2017 05:01 PM

Thanks @Ksharp! This is a very short and efficient way to get the result on data that is actually like my example. However, my real data has some records in which everything is the same except for address type. Something like this:

5 2 M Feb2000 Jan2003 A1A1A1
6 2 R Feb2000 Jan2003 A1A1A1

And this will get rid of both records! Thank you very much.

Ksharp · Posted 10-27-2017 09:00 AM

That would be easy.
After running my code, use
proc sort + nodupkey 
to get rid of these replicated address .

Mahtalii · Posted 10-30-2017 01:59 PM

Thanks @Ksharp. Sorry for my late reply. The problem is, after running your code both observations 5 and 6 will appear in the 'key' table, and will be excluded from the 'want' table which is not desirable. One thing I can do is to get rid of those semi-duplicates (they differ in addresstype) first and then run your code. Thank you,

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away