DATA Step, Macro, Functions and more

Deriving date crossings / date intersections in SCD2 tables

Reply
Super Contributor
Posts: 376

Deriving date crossings / date intersections in SCD2 tables

Note:  I've also posted this to SAS-L (SAS-L archives -- December 2012, week 2 (#8)</title><style type="text/css"><!--BODY { font-family: "...) but wanted to also post this here for a greater audience. 

Hi,

Apologies for the length of this post.  However, most of the length is in code examples which help illustrate the issues.

I need to derive "date crossings" or "date intersections" when joining several SCD Type 2 (SCD2) datasets.  Furthermore, I need to do this same thing to correct data issues in single dataset sourced from our CRM system (a self-join scenario).  (Note:  if there is a standard data warehousing term for "date crossings" or "date intersections", please let me know).

An example may help illustrate this.  Note: I often find it useful to draw these date crossings out on a sheet of paper, like a time continum number line:

* "well-formed" SCD2 data, stretching from "infinity" (01JAN1960) to "infinity" (31DEC9999), ;

* with no gaps in the date ranges, and each new fromDate = previous toDate + 1 ;

data one;

  length key1 var1 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 fromdate todate;

  datalines;

1 . 01JAN1960 14JAN2010

1 A 15JAN2010 30JUN2011

1 B 01JUL2011 31DEC9999

;

run;

data two;

  length key1 var2 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var2 fromdate todate;

  datalines;

1 . 01JAN1960 15MAR2010

1 C 16MAR2010 11NOV2011

1 D 12NOV2011 31DEC9999

;

run;

* this is the desired output.  the first record 01JAN1960 - 14JAN2010 is optional ;

data want;

  length key1 var1 var2 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 var2 fromdate todate;

  datalines;

1 A . 15JAN2010 15MAR2010

1 A C 16MAR2010 30JUN2011

1 B C 01JUL2011 11NOV2011

1 B D 12NOV2011 31DEC9999

;

run;

* this approach works to determine the date crossings ;

* as long as the SCD2 data is "well-formed" ;

proc sql _method stimer;

  create table want2 as

    select

      t1.key1, t1.var1, t2.var2

      ,max(t1.fromDate, t2.fromDate) as fromDate format=date9.

      ,min(t1.toDate,   t2.toDate)   as toDate   format=date9.

    from

      one t1

    left join

      two t2

    on

      t1.key1 = t2.key1

    where

      calculated fromDate <= calculated toDate

      /*  * uncomment this to remove the 01JAN1960 - 14JAN2010 record ;

      and

      coalescec(var1,var2) is not missing

      */

    order by

      key1, calculated fromDate

  ;

quit;

This is an example where we can take advantage of SQL's cartesian product.  With the correct use of functions (min, max) and an appropriate where clause, we get the desired results.

However, note the use of the max function to determine the (smaller) fromDate, and the min function to determine the (larger) toDate.  This causes needed records to be removed at the date "boundaries" if the date spans do not start and end with the same value within the two tables.

Here is an example.  Say we're working with customer records, and our source system does not initialize the new customer at "infinity" (01JAN1960).  Furthermore, say our source system merely closes out the record when a customer leaves the company.

* data without the "infinity" records ;

* i.e. without the initialization record, ;

* and the ending record closed out early ;

* this customer left the company on 15DEC2012 ;

data one;

  length key1 var1 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 fromdate todate;

  datalines;

1 A 15JAN2010 30JUN2011

1 B 01JUL2011 15DEC2012

;

run;

* this customer changed in some way (say address) on 12NOV2011 but is otherwise active ;

data two;

  length key1 var2 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var2 fromdate todate;

  datalines;

1 C 16MAR2010 11NOV2011

1 D 12NOV2011 31DEC9999

;

run;

proc sql _method stimer;

  create table want3 as

    select

      t1.key1, t1.var1, t2.var2

      ,max(t1.fromDate, t2.fromDate) as fromDate format=date9.

      ,min(t1.toDate,   t2.toDate)   as toDate   format=date9.

    from

      one t1

    left join

      two t2

    on

      t1.key1 = t2.key1

    where

      calculated fromDate <= calculated toDate

      /*

      and

      coalescec(var1,var2) is not missing

      */

    order by

      key1, calculated fromDate

  ;

quit;

This does NOT give us the desired results.

Luckily, these simple data issues are relatively easy to fix.

* ensure all time periods stretch from "infinitity" to "infinity" ;

* Macro to fill in date range "gaps", including boundary conditions ;

%macro FillDates(in=, out=, key=, key2=);

  proc sort data=&in out=&in._sorted;

    by &key fromDate;

  run;

  data fillDates;

    format &key fromDate toDate lag_toDate;

    set &in._sorted;

    by &key fromDate;

    temp_fromDate=fromDate;

    temp_toDate  =toDate;

    lag_toDate   =lag(toDate);

    %* do not change the order of these steps ;

    %* we have to first output the "main" record, ;

    %* then the "gap" records with missing dimension data ;

    if first.&key then lag_toDate=.;

    output;

    %* fill in date range gaps ;

    call missing(of &key2);

    if (not first.&key) then if (temp_fromDate ne lag_toDate+1) then do;

      fromDate=lag_toDate+1;

      toDate=temp_fromDate-1;

      output;

    end;

    %* create "negative infinity" date boundary record ;

    if (first.&key) then do;

      if temp_fromDate gt "01JAN1960"d then do;

        fromDate="01JAN1960"d;

        toDate=temp_fromDate-1;

        output;

      end;

    end;

    %* create "positive infinity" date boundary record ;

    if (last.&key) then do;

      if temp_toDate lt "31DEC9999"d then do;

        fromDate=temp_toDate+1;

        toDate="31DEC9999"d;

        output;

      end;

    end;

    format lag_: temp_: date9.;

    drop lag_: temp_:;

  run;

  * now create the final dataset in sorted order ;

  proc sort data=fillDates out=&out;

    by &key fromDate;

  run;

%mend;

options mprint;

%FillDates(in=one, out=one_filled, key=key1, key2=var1)

%FillDates(in=two, out=two_filled, key=key1, key2=var2)

proc sql _method stimer;

  create table want4 as

    select

      t1.key1, t1.var1, t2.var2

      ,max(t1.fromDate, t2.fromDate) as fromDate format=date9.

      ,min(t1.toDate,   t2.toDate)   as toDate   format=date9.

    from

      one_filled t1

    left join

      two_filled t2

    on

      t1.key1 = t2.key1

    where

      calculated fromDate <= calculated toDate

      /*

      and

      coalescec(var1,var2) is not missing

      */

    order by

      key1, calculated fromDate

  ;

quit;

This again gives me the desired results.

So far, so good...

HOWEVER, I'm dealing with a few data quality issues, for which I can't come up with a solution.

#1) Source system does not increment the fromDate/toDate correctly:

data one;

  length key1 var1 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 fromdate todate;

  datalines;

1 . 01JAN1960 14JAN2010

1 A 14JAN2010 30JUN2011

1 B 30JUN2011 31DEC9999

;

run;

data two;

  length key1 var2 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var2 fromdate todate;

  datalines;

1 . 01JAN1960 15MAR2010

1 C 15MAR2010 11NOV2011

1 D 11NOV2011 31DEC9999

;

run;

data want;

  length key1 var1 var2 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 var2 fromdate todate;

  datalines;

1 A . 15JAN2010 15MAR2010

1 A C 16MAR2010 30JUN2011

1 B C 01JUL2011 11NOV2011

1 B D 12NOV2011 31DEC9999

;

run;

I think this issue can be addressed by augmenting / modifying the functionality of %FillDates above.  Do you agree?  Or can you think of a better approach?

#2) Source system does not correctly close out old records / create new records when there is a change ;

* note the "A" record goes to infinity, instead of properly closing out an SCD2 data change ;

data one;

  length key1 var1 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 fromdate todate;

  datalines;

1 A 14JAN2010 31DEC9999

1 B 30JUN2011 11NOV2011

;

run;

data want;

  length key1 var1 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 fromdate todate;

  datalines;

1 . 01JAN1960 13JAN2010 

1 A 14JAN2010 29JUN2011

1 B 30JUN2011 11NOV2011

1 A 12NOV2011 31DEC9999

;

run;

In this scenario, the issue is within the single dataset itself (a self-join scenario).  In this situation, our source system does not properly close out the "B" record, and instead just "reactivates" the "A" record.  This of course causes problems when you need a snapshot of the customer situation at any point in time, plus additional issues when this table is joined with other SCD 2 datasets.

Keep in mind these are simplified examples; the "real" problem might not be with contiguous records.  A quick example:

data one;

  length key1 var1 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 fromdate todate;

  datalines;

1 A 14JAN2010 15FEB2011

1 B 15FEB2011 24SEP2011

1 C 24SEP2011 31DEC9999

1 D 05MAR2012 31AUG2012

1 E 31AUG2012 10DEC2012

;

run;

data want;

  length key1 var1 $1 fromdate todate 8;

  informat fromdate todate date9.;

  format fromdate todate date9.;

  input key1 var1 fromdate todate;

  datalines;

1 . 01JAN1960 13JAN2010 

1 A 14JAN2010 15FEB2011

1 B 16FEB2011 24SEP2011

1 C 25SEP2011 04MAR2012

1 D 05MAR2012 31AUG2012

1 E 01SEP2012 10DEC2012

1 C 11DEC2012 31DEC9999

;

run;

In this made up example, the customer moved 6 times, and moved back to a previous address "C" twice (their parent's house, whatever), where he/she currently resides.  (Plus it's not incrementing the from/to dates properly).

I'm not sure how to tackle this issue?  Perhaps sort on toDate, then something like:

if toDate = "31DEC9999" then if not last.key1 then do; bad_data_flag=1; further processing??? ; end;

Any thoughts?

Thanks,

Scott

Super User
Posts: 9,671

Re: Deriving date crossings / date intersections in SCD2 tables

Hoooo, It is So long . I think it should be easy. if you split the range into every single date , and transpose it from wide to long. You will get it.

Assuming today is infinite date '31dec9999'd to avoid a large dataset, you can change it back after you get the result.

data one;
  length key1 var1 $1 fromdate todate 8;
  informat fromdate todate date9.;
  format fromdate todate date9.;
  input key1 var1 fromdate todate;
  datalines;
1 . 01JAN1960 14JAN2010
1 A 15JAN2010 30JUN2011
1 B 01JUL2011 31DEC9999
;
run;

data two;
  length key1 var2 $1 fromdate todate 8;
  informat fromdate todate date9.;
  format fromdate todate date9.;
  input key1 var2 fromdate todate;
  datalines;
1 . 01JAN1960 15MAR2010
1 C 16MAR2010 11NOV2011
1 D 12NOV2011 31DEC9999
;
run;
data one; 
 set one;
 if todate='31DEC9999'd then todate='12dec2012'd;
run;
data two; 
 set two;
 if todate='31DEC9999'd then todate='12dec2012'd;
run;

data temp1;
 set one;
 a+1;
 do date1=fromdate to todate;
  output;
 end;
 format date1 date9.;
 drop fromdate  todate ;
 run;
 
 data temp2;
 set two;
 b+1;
 do date2=fromdate to todate;
  output;
 end;
 format date2 date9.;
  drop fromdate  todate ;
 run;
data temp;
  merge temp1 temp2;
run;

data temp;
 set temp;
 if a ne lag(a) or b ne lag(b) then group+1;
run;
data want;
 set temp;
 by group;
 retain temp;
 if first.group then temp=date1;
 if last.group then do;date1=temp;output;end;
 drop group a b temp;
run;



Ksharp

Ask a Question
Discussion stats
  • 1 reply
  • 361 views
  • 0 likes
  • 2 in conversation