BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mahtalii
Fluorite | Level 6

Hi,
I have a database of individuals' addresses in SAS like this: 

 

OBS ID addresstype effectivedate enddate postalcode
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003       Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016   C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4


Addresses are either residential (R) or mailing (M).
For each individual and for each period of time, if I have the residential address I want to keep that and delete any mailing address in that period of time. If for a period of time there is no residential address recorded, then I will keep the mailing address. For the individual of this example I want the clean data to look like this:

 

OBS ID addresstype effectivedate enddate postalcode
1       1        M             Feb2000      Jan2003    A1A1A1 
2       1        R             Jan2003       Jan2020    B2B2B2

Is there any way in SAS to create intervals of time based on two dates (effectivedate and enddate in this example) and then compare them against each other. Thanks in advance for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Here is a better optimized version which requires less disk space

 

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
5       2        M             Feb2000      Jan2003    A1A1A1
6       2        R             Feb2000      Jan2003    E1E1E1
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;

do month = 1 to dim(M);
    M{month} = coalesce(R{month}, M{month});
    end;

call missing(obs);
do month = 1 to dim(M);
    if missing(obs) then do;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    else if missing(M{month}) then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        call missing(obs);
        end;
    else if obs ne M{month} then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    end;

format start finish monyy7.;
keep ID start finish obs;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    A.obs,
    A.ID,
    have.addressType,
    A.start,
    A.finish,
    have.postalCode
from 
    A inner join
    have on A.ID=have.ID and A.obs=have.obs;
quit;

proc print data=want noobs; run;
PG

View solution in original post

10 REPLIES 10
Jim_G
Pyrite | Level 9

If your two dates are in SAS date informat, the INTCK function returns the number of days between the two dates.

duration=intck('day',effectivedate,enddate);

 

I think that you are going to want some sorting and logic to see if the shorter dates completely fit inside the longer periods.

 

Maybe sort by ID descending addrestype effectivedate duration. then delete observations that are overlapped.

 

jim

PGStats
Opal | Level 21

Here is a solution using arrays. Array elements represent months. This solution will accomodate any type of overlap between time periods:

 

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;
do month = 1 to dim(M);
    if not missing(R{month}) then do;
        obs = R{month};
        output;
        end;
    else if not missing(M{month}) then do;
        obs = M{month};
        output;
        end;
    end;
keep id month obs;
run;

proc sort data=A; by id obs month; run;

/* Find beginning and end of known address periods. 
   Convert back nb of months to dates. */
data B;
set A;
retain start;
by id obs;
if first.obs then 
    start = intnx("MONTH", "&startOfTime."d, month);
if last.obs then do;
    finish = intnx("MONTH", "&startOfTime."d, month);
    output;
    end;
format start finish monyy7.;
drop month;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    B.obs,
    B.ID,
    have.addressType,
    B.start,
    B.finish,
    have.postalCode
from 
    B inner join
    have on B.ID=have.ID and B.obs=have.obs;
quit;

proc print data=want noobs; run;
PG
Mahtalii
Fluorite | Level 6

Thanks @PGStats ! This is great! However, with the big table that I have, I keep running out of space and I need to find a way to deal with that. Do you have any suggestions? Thanks again for your help.

PGStats
Opal | Level 21

Could you post the log (using the {i} icon above) so that we can see where the space is used up?

PG
Mahtalii
Fluorite | Level 6

Hi @PGStats, Sorry for my late reply. I made my data smaller by deleting very old records that were not needed, and the code worked perfectly. But since you asked, here is the log for the error I received when I was trying to run the code on the whole big table (with dates from 1900 to 2060). Thanks again,

 

 

ERROR: Insufficient space in file WORK.A.DATA.
NOTE: The DATA step has been abnormally terminated.
NOTE: There were 7571824 observations read from the data set WORK.Have.
WARNING: The data set WORK.A may be incomplete.  When this step was stopped there were
         4865641041 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           28:54.77
      user cpu time       6:26.15
      system cpu time     55.31 seconds
      memory              1823.15k
      OS Memory           32140.00k
      Timestamp           2017-10-27 09:50:11 AM
      Step Count                        50  Switch Count  6

 

PGStats
Opal | Level 21

Here is a better optimized version which requires less disk space

 

data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
5       2        M             Feb2000      Jan2003    A1A1A1
6       2        R             Feb2000      Jan2003    E1E1E1
;

%let startOfTime=01JAN1990;  /* Before first effectiveDate */

/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;

/* Convert dates to number of months.
   Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
    set have; by id;
    s = intck("MONTH", "&startOfTime."d, start);
    f = intck("MONTH", "&startOfTime."d, finish);
    if addresstype = "M" then do;
        do month = s to f;
            M{month} = obs;
            end;
        end;
    else do;
        do month = s to f;
            R{month} = obs;
            end;
        end;
    end;

do month = 1 to dim(M);
    M{month} = coalesce(R{month}, M{month});
    end;

call missing(obs);
do month = 1 to dim(M);
    if missing(obs) then do;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    else if missing(M{month}) then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        call missing(obs);
        end;
    else if obs ne M{month} then do;
        finish = intnx("MONTH", "&startOfTime."d, month-1);
        output;
        obs = M{month};
        start = intnx("MONTH", "&startOfTime."d, month);
        end;
    end;

format start finish monyy7.;
keep ID start finish obs;
run;

/* Retrieve original address information */
proc sql;
create table want as
select 
    A.obs,
    A.ID,
    have.addressType,
    A.start,
    A.finish,
    have.postalCode
from 
    A inner join
    have on A.ID=have.ID and A.obs=have.obs;
quit;

proc print data=want noobs; run;
PG
Ksharp
Super User
How about this one ?



data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1       1        M             Feb2000      Jan2003    A1A1A1
2       1        R             Jan2003      Jan2020    B2B2B2
3       1        M             Feb2016      Sep2016    C3C3C3
4       1        M             Sep2016      Jan2020    D4D4D4
;

proc sql;
create table key as 
select b.obs as obs
 from have as a,have as b
  where a.id=b.id and a.start <=b.start and a.finish >=b.finish and a.obs ne b.obs ;
  
create table want as
select *
 from have
  where obs not in (select obs from key);
quit;

Mahtalii
Fluorite | Level 6

Thanks @Ksharp! This is a very short and efficient way to get the result on data that is actually like my example. However, my real data has some records in which everything is the same except for address type. Something like this:

5       2        M             Feb2000      Jan2003    A1A1A1
6       2        R             Feb2000      Jan2003    A1A1A1

 

And this will get rid of both records! Thank you very much.

Ksharp
Super User
That would be easy.
After running my code, use
proc sort + nodupkey 
to get rid of these replicated address .

Mahtalii
Fluorite | Level 6

Thanks @Ksharp. Sorry for my late reply.  The problem is, after running your code both observations 5 and 6 will appear in the 'key' table, and will be excluded from the 'want' table which is not desirable. One thing I can do is to get rid of those semi-duplicates (they differ in addresstype) first and then run your code. Thank you,

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1181 views
  • 3 likes
  • 4 in conversation