Solved: Re: Merging data with dates between two records

WestChan · Posted 02-22-2017 07:30 PM

Hello,

I have a created a data set which includes a a date stamp app_date for every application processed.

I would like to join data from a different table which also includes a date error_date, but I unly want to join the data if the error date is before the app_date but after any prior app_date's.

Ive been racking my brain for a couple of hours and I cannot figure out the logic. Here is an example of what Im looking for:

Table1:

App_num App_Date

1 01jan16

1 08jan16

1 13jan16

Table2:

App_num Error_date Error_Code

1 09jan16 xxx

Output:

App_num App_Date Error_date Error_Code

1 01jan16

1 08jan16

1 13jan16 09jan16 xxx

Any ideas to help?

Thanks,

Nukie

LaurieF · Posted 02-22-2017 09:07 PM

Sorry - ignore that solution. It's rubbish. As I was walking @schnauzerrosie I rewrote it in my head. By turning table1 into a kind-of slowing-changing dimension, you can use SQL to create the output you want, with far fewer data contingencies. The only problem here would be if there were more than one error record within a date range.

This also, if your tables are really large, allows you to put integrity constraints on the tables, to make sure you're coping with duplicates. Note that I've created a low date for app_date_previous for the first instance per app_num - integrity constraints hate null values.

data table1;
infile cards;
attrib app_num length=8;
attrib app_date length=4 informat=date7. format=date9.;
input app_num
      app_date;
cards;
1 01jan16
1 08jan16
1 13jan16
;
run;

data table2;
infile cards;
attrib app_num length=8;
attrib error_date length=4 informat=date7. format=date9.;
attrib error_code length=$ 3;
input app_num
      error_date
      error_code;
cards;
1 09jan16 xxx
;
run;

data table1_scd;
set table1;
retain app_date_previous;
attrib app_date_previous length=4 format=date9.;
by app_num;
if first.app_num then
   app_date_previous = '1jan1900'd;
output;
app_date_previous = app_date;
run;

proc sql;
create table want as
   select table1.app_num,
          table1.app_date,
          table2.error_date,
          table2.error_code
     from table1_scd as table1
     left join table2
       on table1.app_num = table2.app_num
      and table2.error_date between table1.app_date_previous and table1.app_date
    order by table1.app_num,
             table1.app_date;
quit;

View solution in original post

LaurieF · Posted 02-22-2017 08:01 PM

This will work, but you may need to tweak it a bit, depending on how what you mean between to signify - is it inclusive at both ends or not? Also if you are going to have multiple entries per app_num in your error table, it becomes more complicated to solve.

data table1;
infile cards;
attrib app_num length=8;
attrib app_date informat=date7. format=date9.;
input app_num
      app_date;
cards;
1 01jan16
1 08jan16
1 13jan16
;
run;

data table2;
infile cards;
attrib app_num length=8;
attrib error_date informat=date7. format=date9.;
attrib error_code length=$ 3;
input app_num
      error_date
      error_code;
cards;
1 09jan16 xxx
;
run;

data want;
merge table1(in=in_mast)
      table2(in=in_error);
by app_num;
if in_mast;
attrib save_app_date format=date9.;
attrib save_error_code length=$ 3;
retain save_app_date save_error_date save_error_code;
if first.app_num 
   then do;
        save_error_date = error_date;
        save_error_code = error_code;
        call missing(error_date, error_code); /* This is why the error values are being saved */
        end;
   else if save_app_date < save_error_date < app_date then do;
           error_date = save_error_date;
           error_code = save_error_code;
           end;
save_app_date = app_date;
drop save:;
run;

LaurieF · Posted 02-22-2017 09:07 PM

Sorry - ignore that solution. It's rubbish. As I was walking @schnauzerrosie I rewrote it in my head. By turning table1 into a kind-of slowing-changing dimension, you can use SQL to create the output you want, with far fewer data contingencies. The only problem here would be if there were more than one error record within a date range.

This also, if your tables are really large, allows you to put integrity constraints on the tables, to make sure you're coping with duplicates. Note that I've created a low date for app_date_previous for the first instance per app_num - integrity constraints hate null values.

data table1;
infile cards;
attrib app_num length=8;
attrib app_date length=4 informat=date7. format=date9.;
input app_num
      app_date;
cards;
1 01jan16
1 08jan16
1 13jan16
;
run;

data table2;
infile cards;
attrib app_num length=8;
attrib error_date length=4 informat=date7. format=date9.;
attrib error_code length=$ 3;
input app_num
      error_date
      error_code;
cards;
1 09jan16 xxx
;
run;

data table1_scd;
set table1;
retain app_date_previous;
attrib app_date_previous length=4 format=date9.;
by app_num;
if first.app_num then
   app_date_previous = '1jan1900'd;
output;
app_date_previous = app_date;
run;

proc sql;
create table want as
   select table1.app_num,
          table1.app_date,
          table2.error_date,
          table2.error_code
     from table1_scd as table1
     left join table2
       on table1.app_num = table2.app_num
      and table2.error_date between table1.app_date_previous and table1.app_date
    order by table1.app_num,
             table1.app_date;
quit;

WestChan · Posted 02-22-2017 11:22 PM

@LaurieF Perfect! This was the code was looking for.

Ksharp · Posted 02-22-2017 09:37 PM

data table1;
infile cards;
attrib app_num length=8;
attrib app_date informat=date7. format=date9.;
input app_num
      app_date;
cards;
1 01jan16
1 08jan16
1 13jan16
;
run;

data table2;
infile cards;
attrib app_num length=8;
attrib error_date informat=date7. format=date9.;
attrib error_code length=$ 3;
input app_num
      error_date
      error_code;
cards;
1 09jan16 xxx
;
run;
data want;
 set table1(in=ina) table2(in=inb rename=(error_date=app_date error_code=code));
 by app_num app_date;
 lag_date=lag(app_date);
 lag_code=lag(code);
 if lag(inb)=1 then do;error_date=lag_date;error_code=lag_code;end;
 if ina;
 format error_date date9.;
 drop lag_: code;
run;

mkeintz · Posted 02-22-2017 10:07 PM

The PDV (program data vector) is your friend. In particular, you can take advantage of the fact that variables from SET statements are automatically retained until the next SET that retrieves the same vars is encountered. You want to "retain" vars from table 2 for the subsequent table1 observation:

data want (drop=date);
  set table1 (in=in1 keep=app_date  rename=(app_date=date))
      table2 (in=in2 keep=error_date rename=(error_date=date));
  by date;

  if in1=1 then set table1;
  if in2=1 then set table2;
  if in1;
run;

The first SET statement, keeping only the BY variable, is just a means of controlling the sequence of data retrieved by the 2nd and 3rd SET statements.

Editted addition: Just realized I didn't demonstrate how to incorporate this for by groups (e.g. APP_NUM):

data want (drop=date);
  set table1 (in=in1 keep=app_num app_date   rename=(app_date=date))
      table2 (in=in2 keep=app_num error_date rename=(error_date=date));
  by app_num date;

  if in1=1 then set table1;
  if in2=1 then set table2;
  if in1 then output;
  if last.app_num then call missing(of _all_);
run;

Notes:

This allows the table2 data to persist for ALL subseqent table 1 observations, until either a new table2 record or end-of-by-group. If you don't want table2 to persist beyond a single following table1 record, then change
if last.app_num then call missing(of _all_);
to
if in1 then call missing(of _all_);

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Call for Content

Classroom Training Available!