Creating time intervals based on two date variables

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

Creating time intervals based on two date variables

Hi,
I have a database of individuals' addresses in SAS like this: 

 

OBS ID addresstype effectivedate enddate postalcode
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003       Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016   C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4


Addresses are either residential (R) or mailing (M).
For each individual and for each period of time, if I have the residential address I want to keep that and delete any mailing address in that period of time. If for a period of time there is no residential address recorded, then I will keep the mailing address. For the individual of this example I want the clean data to look like this:

 

OBS ID addresstype effectivedate enddate postalcode
1       1        M             Feb2000      Jan2003    A1A1A1 
2       1        R             Jan2003       Jan2020    B2B2B2

Is there any way in SAS to create intervals of time based on two dates (effectivedate and enddate in this example) and then compare them against each other. Thanks in advance for your help.


Accepted Solutions
Solution
‎10-30-2017 02:15 PM
Esteemed Advisor
Posts: 5,113

Re: Creating time intervals based on two date variables

Here is a better optimized version which requires less disk space

 

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
5       2        M             Feb2000      Jan2003    A1A1A1
6       2        R             Feb2000      Jan2003    E1E1E1
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;

do month = 1 to dim(M);
    M{month} = coalesce(R{month}, M{month});
    end;

call missing(obs);
do month = 1 to dim(M);
    if missing(obs) then do;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    else if missing(M{month}) then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        call missing(obs);
        end;
    else if obs ne M{month} then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    end;

format start finish monyy7.;
keep ID start finish obs;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    A.obs,
    A.ID,
    have.addressType,
    A.start,
    A.finish,
    have.postalCode
from 
    A inner join
    have on A.ID=have.ID and A.obs=have.obs;
quit;

proc print data=want noobs; run;
PG

View solution in original post


All Replies
Frequent Contributor
Posts: 102

Re: Creating time intervals based on two date variabl

If your two dates are in SAS date informat, the INTCK function returns the number of days between the two dates.

duration=intck('day',effectivedate,enddate);

 

I think that you are going to want some sorting and logic to see if the shorter dates completely fit inside the longer periods.

 

Maybe sort by ID descending addrestype effectivedate duration. then delete observations that are overlapped.

 

jim

Esteemed Advisor
Posts: 5,113

Re: Creating time intervals based on two date variables

Here is a solution using arrays. Array elements represent months. This solution will accomodate any type of overlap between time periods:

 

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;
do month = 1 to dim(M);
    if not missing(R{month}) then do;
        obs = R{month};
        output;
        end;
    else if not missing(M{month}) then do;
        obs = M{month};
        output;
        end;
    end;
keep id month obs;
run;

proc sort data=A; by id obs month; run;

/* Find beginning and end of known address periods. 
   Convert back nb of months to dates. */
data B;
set A;
retain start;
by id obs;
if first.obs then 
    start = intnx("MONTH", "&startOfTime."d, month);
if last.obs then do;
    finish = intnx("MONTH", "&startOfTime."d, month);
    output;
    end;
format start finish monyy7.;
drop month;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    B.obs,
    B.ID,
    have.addressType,
    B.start,
    B.finish,
    have.postalCode
from 
    B inner join
    have on B.ID=have.ID and B.obs=have.obs;
quit;

proc print data=want noobs; run;
PG
Occasional Contributor
Posts: 6

Re: Creating time intervals based on two date variables

Thanks @PGStats ! This is great! However, with the big table that I have, I keep running out of space and I need to find a way to deal with that. Do you have any suggestions? Thanks again for your help.

Esteemed Advisor
Posts: 5,113

Re: Creating time intervals based on two date variables

Could you post the log (using the {i} icon above) so that we can see where the space is used up?

PG
Occasional Contributor
Posts: 6

Re: Creating time intervals based on two date variables

Hi @PGStats, Sorry for my late reply. I made my data smaller by deleting very old records that were not needed, and the code worked perfectly. But since you asked, here is the log for the error I received when I was trying to run the code on the whole big table (with dates from 1900 to 2060). Thanks again,

 

 

ERROR: Insufficient space in file WORK.A.DATA.
NOTE: The DATA step has been abnormally terminated.
NOTE: There were 7571824 observations read from the data set WORK.Have.
WARNING: The data set WORK.A may be incomplete.  When this step was stopped there were
         4865641041 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           28:54.77
      user cpu time       6:26.15
      system cpu time     55.31 seconds
      memory              1823.15k
      OS Memory           32140.00k
      Timestamp           2017-10-27 09:50:11 AM
      Step Count                        50  Switch Count  6

 

Solution
‎10-30-2017 02:15 PM
Esteemed Advisor
Posts: 5,113

Re: Creating time intervals based on two date variables

Here is a better optimized version which requires less disk space

 

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
5       2        M             Feb2000      Jan2003    A1A1A1
6       2        R             Feb2000      Jan2003    E1E1E1
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;

do month = 1 to dim(M);
    M{month} = coalesce(R{month}, M{month});
    end;

call missing(obs);
do month = 1 to dim(M);
    if missing(obs) then do;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    else if missing(M{month}) then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        call missing(obs);
        end;
    else if obs ne M{month} then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    end;

format start finish monyy7.;
keep ID start finish obs;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    A.obs,
    A.ID,
    have.addressType,
    A.start,
    A.finish,
    have.postalCode
from 
    A inner join
    have on A.ID=have.ID and A.obs=have.obs;
quit;

proc print data=want noobs; run;
PG
Super User
Posts: 10,313

Re: Creating time intervals based on two date variables

How about this one ?



data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
;

proc sql;
create table key as 
select b.obs as obs
 from have as a,have as b
  where a.id=b.id and a.start <=b.start and a.finish >=b.finish and a.obs ne b.obs ;
  
create table want as
select *
 from have
  where obs not in (select obs from key);
quit;

Occasional Contributor
Posts: 6

Re: Creating time intervals based on two date variables

Thanks @Ksharp! This is a very short and efficient way to get the result on data that is actually like my example. However, my real data has some records in which everything is the same except for address type. Something like this:

5       2        M             Feb2000      Jan2003    A1A1A1
6       2        R             Feb2000      Jan2003    A1A1A1

 

And this will get rid of both records! Thank you very much.

Super User
Posts: 10,313

Re: Creating time intervals based on two date variables

That would be easy.
After running my code, use
proc sort + nodupkey 
to get rid of these replicated address .

Occasional Contributor
Posts: 6

Re: Creating time intervals based on two date variables

Thanks @Ksharp. Sorry for my late reply.  The problem is, after running your code both observations 5 and 6 will appear in the 'key' table, and will be excluded from the 'want' table which is not desirable. One thing I can do is to get rid of those semi-duplicates (they differ in addresstype) first and then run your code. Thank you,

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 252 views
  • 3 likes
  • 4 in conversation