Hi,
I have a database of individuals' addresses in SAS like this:
OBS ID addresstype effectivedate enddate postalcode
1 1 M Feb2000 Jan2003 A1A1A1
2 1 R Jan2003 Jan2020 B2B2B2
3 1 M Feb2016 Sep2016 C3C3C3
4 1 M Sep2016 Jan2020 D4D4D4
Addresses are either residential (R) or mailing (M).
For each individual and for each period of time, if I have the residential address I want to keep that and delete any mailing address in that period of time. If for a period of time there is no residential address recorded, then I will keep the mailing address. For the individual of this example I want the clean data to look like this:
OBS ID addresstype effectivedate enddate postalcode
1 1 M Feb2000 Jan2003 A1A1A1
2 1 R Jan2003 Jan2020 B2B2B2
Is there any way in SAS to create intervals of time based on two dates (effectivedate and enddate in this example) and then compare them against each other. Thanks in advance for your help.
Here is a better optimized version which requires less disk space
data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1 1 M Feb2000 Jan2003 A1A1A1
2 1 R Jan2003 Jan2020 B2B2B2
3 1 M Feb2016 Sep2016 C3C3C3
4 1 M Sep2016 Jan2020 D4D4D4
5 2 M Feb2000 Jan2003 A1A1A1
6 2 R Feb2000 Jan2003 E1E1E1
;
%let startOfTime=01JAN1990; /* Before first effectiveDate */
/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;
/* Convert dates to number of months.
Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
set have; by id;
s = intck("MONTH", "&startOfTime."d, start);
f = intck("MONTH", "&startOfTime."d, finish);
if addresstype = "M" then do;
do month = s to f;
M{month} = obs;
end;
end;
else do;
do month = s to f;
R{month} = obs;
end;
end;
end;
do month = 1 to dim(M);
M{month} = coalesce(R{month}, M{month});
end;
call missing(obs);
do month = 1 to dim(M);
if missing(obs) then do;
obs = M{month};
start = intnx("MONTH", "&startOfTime."d, month);
end;
else if missing(M{month}) then do;
finish = intnx("MONTH", "&startOfTime."d, month-1);
output;
call missing(obs);
end;
else if obs ne M{month} then do;
finish = intnx("MONTH", "&startOfTime."d, month-1);
output;
obs = M{month};
start = intnx("MONTH", "&startOfTime."d, month);
end;
end;
format start finish monyy7.;
keep ID start finish obs;
run;
/* Retrieve original address information */
proc sql;
create table want as
select
A.obs,
A.ID,
have.addressType,
A.start,
A.finish,
have.postalCode
from
A inner join
have on A.ID=have.ID and A.obs=have.obs;
quit;
proc print data=want noobs; run;
If your two dates are in SAS date informat, the INTCK function returns the number of days between the two dates.
duration=intck('day',effectivedate,enddate);
I think that you are going to want some sorting and logic to see if the shorter dates completely fit inside the longer periods.
Maybe sort by ID descending addrestype effectivedate duration. then delete observations that are overlapped.
jim
Here is a solution using arrays. Array elements represent months. This solution will accomodate any type of overlap between time periods:
data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1 1 M Feb2000 Jan2003 A1A1A1
2 1 R Jan2003 Jan2020 B2B2B2
3 1 M Feb2016 Sep2016 C3C3C3
4 1 M Sep2016 Jan2020 D4D4D4
;
%let startOfTime=01JAN1990; /* Before first effectiveDate */
/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;
/* Convert dates to number of months.
Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
set have; by id;
s = intck("MONTH", "&startOfTime."d, start);
f = intck("MONTH", "&startOfTime."d, finish);
if addresstype = "M" then do;
do month = s to f;
M{month} = obs;
end;
end;
else do;
do month = s to f;
R{month} = obs;
end;
end;
end;
do month = 1 to dim(M);
if not missing(R{month}) then do;
obs = R{month};
output;
end;
else if not missing(M{month}) then do;
obs = M{month};
output;
end;
end;
keep id month obs;
run;
proc sort data=A; by id obs month; run;
/* Find beginning and end of known address periods.
Convert back nb of months to dates. */
data B;
set A;
retain start;
by id obs;
if first.obs then
start = intnx("MONTH", "&startOfTime."d, month);
if last.obs then do;
finish = intnx("MONTH", "&startOfTime."d, month);
output;
end;
format start finish monyy7.;
drop month;
run;
/* Retrieve original address information */
proc sql;
create table want as
select
B.obs,
B.ID,
have.addressType,
B.start,
B.finish,
have.postalCode
from
B inner join
have on B.ID=have.ID and B.obs=have.obs;
quit;
proc print data=want noobs; run;
Thanks @PGStats ! This is great! However, with the big table that I have, I keep running out of space and I need to find a way to deal with that. Do you have any suggestions? Thanks again for your help.
Could you post the log (using the {i} icon above) so that we can see where the space is used up?
Hi @PGStats, Sorry for my late reply. I made my data smaller by deleting very old records that were not needed, and the code worked perfectly. But since you asked, here is the log for the error I received when I was trying to run the code on the whole big table (with dates from 1900 to 2060). Thanks again,
ERROR: Insufficient space in file WORK.A.DATA. NOTE: The DATA step has been abnormally terminated. NOTE: There were 7571824 observations read from the data set WORK.Have. WARNING: The data set WORK.A may be incomplete. When this step was stopped there were 4865641041 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 28:54.77 user cpu time 6:26.15 system cpu time 55.31 seconds memory 1823.15k OS Memory 32140.00k Timestamp 2017-10-27 09:50:11 AM Step Count 50 Switch Count 6
Here is a better optimized version which requires less disk space
data have;
length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6;
input OBS ID addresstype effectivedate enddate postalcode;
start = input(cats('01', effectivedate), date9.);
finish = input(cats('01', enddate), date9.);
format start finish monyy7.;
drop effectivedate enddate;
datalines;
1 1 M Feb2000 Jan2003 A1A1A1
2 1 R Jan2003 Jan2020 B2B2B2
3 1 M Feb2016 Sep2016 C3C3C3
4 1 M Sep2016 Jan2020 D4D4D4
5 2 M Feb2000 Jan2003 A1A1A1
6 2 R Feb2000 Jan2003 E1E1E1
;
%let startOfTime=01JAN1990; /* Before first effectiveDate */
/* Find number of months of latest enddate since &startOfTime */
proc sql noprint;
select max(intck("MONTH", "&startOfTime."d, finish))+1
into :maxMonth
from have;
quit;
/* Convert dates to number of months.
Use separate month arrays for M and R addresses. */
data A;
array M{&maxMonth.};
array R{&maxMonth.};
do until(last.id);
set have; by id;
s = intck("MONTH", "&startOfTime."d, start);
f = intck("MONTH", "&startOfTime."d, finish);
if addresstype = "M" then do;
do month = s to f;
M{month} = obs;
end;
end;
else do;
do month = s to f;
R{month} = obs;
end;
end;
end;
do month = 1 to dim(M);
M{month} = coalesce(R{month}, M{month});
end;
call missing(obs);
do month = 1 to dim(M);
if missing(obs) then do;
obs = M{month};
start = intnx("MONTH", "&startOfTime."d, month);
end;
else if missing(M{month}) then do;
finish = intnx("MONTH", "&startOfTime."d, month-1);
output;
call missing(obs);
end;
else if obs ne M{month} then do;
finish = intnx("MONTH", "&startOfTime."d, month-1);
output;
obs = M{month};
start = intnx("MONTH", "&startOfTime."d, month);
end;
end;
format start finish monyy7.;
keep ID start finish obs;
run;
/* Retrieve original address information */
proc sql;
create table want as
select
A.obs,
A.ID,
have.addressType,
A.start,
A.finish,
have.postalCode
from
A inner join
have on A.ID=have.ID and A.obs=have.obs;
quit;
proc print data=want noobs; run;
How about this one ? data have; length obs id 8 addresstype $1 effectivedate $7 enddate $7 postalcode $6; input OBS ID addresstype effectivedate enddate postalcode; start = input(cats('01', effectivedate), date9.); finish = input(cats('01', enddate), date9.); format start finish monyy7.; drop effectivedate enddate; datalines; 1 1 M Feb2000 Jan2003 A1A1A1 2 1 R Jan2003 Jan2020 B2B2B2 3 1 M Feb2016 Sep2016 C3C3C3 4 1 M Sep2016 Jan2020 D4D4D4 ; proc sql; create table key as select b.obs as obs from have as a,have as b where a.id=b.id and a.start <=b.start and a.finish >=b.finish and a.obs ne b.obs ; create table want as select * from have where obs not in (select obs from key); quit;
Thanks @Ksharp! This is a very short and efficient way to get the result on data that is actually like my example. However, my real data has some records in which everything is the same except for address type. Something like this:
5 2 M Feb2000 Jan2003 A1A1A1
6 2 R Feb2000 Jan2003 A1A1A1
And this will get rid of both records! Thank you very much.
That would be easy. After running my code, use proc sort + nodupkey to get rid of these replicated address .
Thanks @Ksharp. Sorry for my late reply. The problem is, after running your code both observations 5 and 6 will appear in the 'key' table, and will be excluded from the 'want' table which is not desirable. One thing I can do is to get rid of those semi-duplicates (they differ in addresstype) first and then run your code. Thank you,
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.