BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
SasStatistics
Pyrite | Level 9

I have a SAS customer table (that is part of a greater data warehouse) that is updated every night. Since it is important to keep track of the customer history, I now need to create a "historic_info_customers" table  that keeps track of what values the customer has had during specific periods. A value that is current in use has the "valid_to_date" = 9999-12-31 (see code example below). 

From the following four snapshots of the customer table (that are basically "test cases"): 

/*The customer table at the "beginning". */
data customer_table_snapshot_1;
   input date :yymmdd10. customer_id segment_generation_1 $;
   format date yymmdd10.;
   datalines;
   2023-01-01 111 A
   2023-01-01 222 B
   2023-01-01 333 C
   ;
run;

/*Customers 222 (B-->C) and 333 (C-->A) have changed segment belonging.*/
data customer_table_snapshot_2;
   input date :yymmdd10. customer_id segment_generation_1 $;
   format date yymmdd10.;
   datalines;
   2023-01-07 111 A
   2023-01-07 222 C
   2023-01-07 333 A
   ;
run;

/*Now a new segment generation (column "segment_generation_2") has been added. */
/*segment_generation_1 will be the same containing its "old values".*/
/*It might be that the column (segment_generation_1) is deleted and it might also be that the*/
/*values continue to update at the same time as segment_generation_2 column exists, */
/*so good if the solution can incorporate this flexibility. */

data customer_table_snapshot_3;
   input date :yymmdd10. customer_id segment_generation_1 $  segment_generation_2 $ ;
   format date yymmdd10.;
   datalines;
   2023-01-10 111 A A
   2023-01-10 222 C C
   2023-01-10 333 A A
   ;
run;

/*All three customers have changed their segment_generation_2 value. */
data customer_table_snapshot_4;
   input date :yymmdd10. customer_id segment_generation_1 $  segment_generation_2 $ ;
   format date yymmdd10.;
   datalines;
   2023-01-15 111 A B
   2023-01-15 222 C B
   2023-01-15 333 A B
   ;
run;


I now need to create a program that would append a table and have the following look: 

/*Now I would need to make a table that keeps track of the historical values,
so I was thinking to create a table that is updated every night and should look 
like the table I provided below. 

More specifically, from the snapshot of the customer table, we should create a table that looks like the one below
*/

/*The table is sorted on customer_id. */

data historic_info_customers; 
	LENGTH customer_id 8 segment_generation $ 100 segment_belonging $ 1 valid_from_date 8 valid_to_date 8;
	input customer_id segment_generation $ segment_belonging $ valid_from_date :yymmdd10. valid_to_date :yymmdd10.;
	format valid_from_date yymmdd10. valid_to_date yymmdd10.;
	datalines;
	111 segment_generation_1 A 2023-01-01 2023-01-10
	111 segment_generation_2 A 2023-01-11 2023-01-15
	111 segment_generation_2 B 2023-01-16 9999-12-31

	222 segment_generation_1 B 2023-01-01 2023-01-07
	222 segment_generation_1 C 2023-01-08 2023-01-10
	222 segment_generation_2 C 2023-01-11 2023-01-15
	222 segment_generation_2 B 2023-01-16 9999-12-31

	333 segment_generation_1 C 2023-01-01 2023-01-07
	333 segment_generation_1 A 2023-01-08 2023-01-10
	333 segment_generation_2 A 2023-01-11 2023-01-15
	333 segment_generation_2 B 2023-01-16 9999-12-31
	;
run; 



Ps. maybe this is not the best solution, but due to legacy reason it has to be done this way. There is some flexibility in how the "historic_info_customers"
table can look like, as long as it captures what is needed (as demonstrated above by the different "test cases"). 

All advice is much appreciated. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
s_lassen
Meteorite | Level 14

One possibility is to update the historical table in place. First, create the table and give it an index:

data historic_info_customers;                                                                                                           
      LENGTH customer_id 8 segment_generation $ 100 segment_belonging $ 1 valid_from_date 8 valid_to_date 8;                            
      format valid_from_date yymmdd10. valid_to_date yymmdd10.;                                                                         
  stop;                                                                                                                                 
run;                                                                                                                                    
proc sql;                                                                                                                               
  create index idx on historic_info_customers(customer_ID,valid_to_date);                                                               
quit;

Then, to update the data using the "old" format, you can use a macro like this:

%macro update_history1(snapshot);                                                                                                       
                                                                                                                                        
data historic_info_customers;                                                                                                           
  set &snapshot;                                                                                                                        
  valid_to_date='31dec9999'd;                                                                                                           
  modify historic_info_customers key=idx;                                                                                               
  if not _iorc_ then do;                                                                                                                
    if segment_belonging ne segment_generation_1 then do;                                                                               
      valid_to_date=date-1;                                                                                                             
      replace;                                                                                                                          
      valid_from_date=date;                                                                                                             
      valid_to_date='31dec9999'd;                                                                                                       
      segment_generation='segment_generation_1';                                                                                        
      segment_belonging=segment_generation_1;                                                                                           
      output;                                                                                                                           
      end;                                                                                                                              
    end;                                                                                                                                
  else do;                                                                                                                              
    _error_=0;                                                                                                                          
    valid_from_date=date;                                                                                                               
    valid_to_date='31dec9999'd;                                                                                                         
    segment_generation='segment_generation_1';                                                                                          
    segment_belonging=segment_generation_1;                                                                                             
    output;                                                                                                                             
    end;                                                                                                                                
  run;                                                                                                                                  
%mend;                                                                                                                                  
%update_history1(customer_table_snapshot_1);                                                                                            
%update_history1(customer_table_snapshot_2);

Note that I changed the output data slightly, as it seemed illogical that the VALID_FROM_DATE is the same as DATE from the snapshot, whereas it is one day later in the following records. So instead I set the VALID_FROM_DATE equal to the snapshot DATE for all records in the history.

 

To update with the new data format, you can then create a slightly modified macro:

%macro update_history2(snapshot);                                                                                                       
data historic_info_customers;                                                                                                           
  set &snapshot;                                                                                                                        
  valid_to_date='31dec9999'd;                                                                                                           
  modify historic_info_customers key=idx;                                                                                               
  if not _iorc_ then do;                                                                                                                
    if segment_belonging ne segment_generation_2 or segment_generation ne 'segment_generation_2' then do;                               
      valid_to_date=date-1;                                                                                                             
      replace;                                                                                                                          
      valid_from_date=date;                                                                                                             
      valid_to_date='31dec9999'd;                                                                                                       
      segment_generation='segment_generation_2';                                                                                        
      segment_belonging=segment_generation_2;                                                                                           
      output;                                                                                                                           
      end;                                                                                                                              
    end;                                                                                                                                
  else do;                                                                                                                              
    _error_=0;                                                                                                                          
    valid_from_date=date;                                                                                                               
    valid_to_date='31dec9999'd;                                                                                                         
    segment_generation='segment_generation_2';                                                                                          
    segment_belonging=segment_generation_2;                                                                                             
    output;                                                                                                                             
    end;                                                                                                                                
  run;                                                                                                                                  
%mend;                                                                                                                                  
                                                                                                                                        
                                                                                                                                        
%update_history2(customer_table_snapshot_3);                                                                                            
%update_history2(customer_table_snapshot_4);

You will then have the data you wanted (except for the DATE issue), but the data will not be sorted by the Customer_ID. But as the data has an index, you can set it by Customer_ID and VALID_TO_DATE with very little performance loss.

 

You will probably want to save all the snapshots, so that you can do a rerun if something goes wrong. This example does have any checks for the order of the dates, so you will not get any error message if you load the snapshots in the wrong order, but the output will look very strange, date-wise. So it is just an example of how such things can be done.

 

It may be a good idea put the output in a database, where you can perform a roll back if things go wrong when you update.

 

View solution in original post

8 REPLIES 8
Reeza
Super User
You need to create a slowly changing dimension, there are a few standard ways, see the options/pros/cons here

https://www.1keydata.com/datawarehousing/slowly-changing-dimensions-type-4.html

I would add to your table the segement and generation (two items, not just the segment.
SasStatistics
Pyrite | Level 9
What I can see this gives an conceptual understanding and no actual code?
s_lassen
Meteorite | Level 14

One possibility is to update the historical table in place. First, create the table and give it an index:

data historic_info_customers;                                                                                                           
      LENGTH customer_id 8 segment_generation $ 100 segment_belonging $ 1 valid_from_date 8 valid_to_date 8;                            
      format valid_from_date yymmdd10. valid_to_date yymmdd10.;                                                                         
  stop;                                                                                                                                 
run;                                                                                                                                    
proc sql;                                                                                                                               
  create index idx on historic_info_customers(customer_ID,valid_to_date);                                                               
quit;

Then, to update the data using the "old" format, you can use a macro like this:

%macro update_history1(snapshot);                                                                                                       
                                                                                                                                        
data historic_info_customers;                                                                                                           
  set &snapshot;                                                                                                                        
  valid_to_date='31dec9999'd;                                                                                                           
  modify historic_info_customers key=idx;                                                                                               
  if not _iorc_ then do;                                                                                                                
    if segment_belonging ne segment_generation_1 then do;                                                                               
      valid_to_date=date-1;                                                                                                             
      replace;                                                                                                                          
      valid_from_date=date;                                                                                                             
      valid_to_date='31dec9999'd;                                                                                                       
      segment_generation='segment_generation_1';                                                                                        
      segment_belonging=segment_generation_1;                                                                                           
      output;                                                                                                                           
      end;                                                                                                                              
    end;                                                                                                                                
  else do;                                                                                                                              
    _error_=0;                                                                                                                          
    valid_from_date=date;                                                                                                               
    valid_to_date='31dec9999'd;                                                                                                         
    segment_generation='segment_generation_1';                                                                                          
    segment_belonging=segment_generation_1;                                                                                             
    output;                                                                                                                             
    end;                                                                                                                                
  run;                                                                                                                                  
%mend;                                                                                                                                  
%update_history1(customer_table_snapshot_1);                                                                                            
%update_history1(customer_table_snapshot_2);

Note that I changed the output data slightly, as it seemed illogical that the VALID_FROM_DATE is the same as DATE from the snapshot, whereas it is one day later in the following records. So instead I set the VALID_FROM_DATE equal to the snapshot DATE for all records in the history.

 

To update with the new data format, you can then create a slightly modified macro:

%macro update_history2(snapshot);                                                                                                       
data historic_info_customers;                                                                                                           
  set &snapshot;                                                                                                                        
  valid_to_date='31dec9999'd;                                                                                                           
  modify historic_info_customers key=idx;                                                                                               
  if not _iorc_ then do;                                                                                                                
    if segment_belonging ne segment_generation_2 or segment_generation ne 'segment_generation_2' then do;                               
      valid_to_date=date-1;                                                                                                             
      replace;                                                                                                                          
      valid_from_date=date;                                                                                                             
      valid_to_date='31dec9999'd;                                                                                                       
      segment_generation='segment_generation_2';                                                                                        
      segment_belonging=segment_generation_2;                                                                                           
      output;                                                                                                                           
      end;                                                                                                                              
    end;                                                                                                                                
  else do;                                                                                                                              
    _error_=0;                                                                                                                          
    valid_from_date=date;                                                                                                               
    valid_to_date='31dec9999'd;                                                                                                         
    segment_generation='segment_generation_2';                                                                                          
    segment_belonging=segment_generation_2;                                                                                             
    output;                                                                                                                             
    end;                                                                                                                                
  run;                                                                                                                                  
%mend;                                                                                                                                  
                                                                                                                                        
                                                                                                                                        
%update_history2(customer_table_snapshot_3);                                                                                            
%update_history2(customer_table_snapshot_4);

You will then have the data you wanted (except for the DATE issue), but the data will not be sorted by the Customer_ID. But as the data has an index, you can set it by Customer_ID and VALID_TO_DATE with very little performance loss.

 

You will probably want to save all the snapshots, so that you can do a rerun if something goes wrong. This example does have any checks for the order of the dates, so you will not get any error message if you load the snapshots in the wrong order, but the output will look very strange, date-wise. So it is just an example of how such things can be done.

 

It may be a good idea put the output in a database, where you can perform a roll back if things go wrong when you update.

 

SasStatistics
Pyrite | Level 9

Before I dig into details, why are you doing this?

proc sql;                                                                                                                               
  create index idx on historic_info_customers(customer_ID,valid_to_date);                                                               
quit;

When trying to sort on customer_id later, i.e.; 

proc sort data=historic_info_customers;
	by customer_id; 
run; 

I get the error:

ERROR: Indexed data set cannot be sorted in place unless the FORCE option is used.

which is unwanted behaviour if someone would like to sort the table on customer_id?

Many thanks for your help. 

s_lassen
Meteorite | Level 14

The index is createdf so that the method I showed can work. There are other ways to do it, such as completely rewriting the table every time. 

 

You should not sort the table in order to print. If you want to see it in order, just use SQL:

select * from historic_info_customers
  order by customer_ID,valid_to_date;

which should go fast, as the table is indexed.

 

You can also use it as is in other applications:

data <something>;
  set historic_info_customers;
  by customer_ID valid_to_date;
  <do something>;
run;

which will use the index to retrieve data in the correct order.

 

If you need to have access to the data in the correct order, but without specifying the BY or order BY, you can create a view:

proc sql;
  create view v_historic_info_customers as
  select * from historic_info_customers
  order by customer_ID,valid_to_date;

Which you can then use like this:

data <something>;
  merge <something else> v_historic_info_customers;
  by customer_ID;
run;

I am not saying my suggested method is the only way to do things, it is just one possible way to do it.

SasStatistics
Pyrite | Level 9

Your help is really appreciated. 

Regarding the first question, there is for me no need to sort the data inplace. So I simply do: 

data test; 
	set historic_info_customers;
run; 

proc sort data=test; 
	by customer_id; 
run; 

---------------------------------------------------------------------------------------------------------
Now comes some new questions that I hope you might know the answer to: 

What does this code do and how does it relate to the "else do" block? 

if not _iorc_ then do;

What is the purpose of this code? 

_error_=0;
Patrick
Opal | Level 21

As @Reeza already stated what you describe and show us is a slowly changing dimension type 2 (SCD 2). 

Your expectation for someone to provide ready-made fully tested code for your data free of charge is a bit much for a forum where users help users. 

 

What you need to do:

1. for any record in source work out if it's a change to an existing record or a new record

2. for new records: Insert a new record into the target table (valid_from_date set to todays date, valid_to_date to some date in the far future)

3. for updates of an existing record: A) expire the existing record (change valid_to_date to yesterdays date), B) insert a new record with valid_to_from_date set to today's date and valid_to_date to some date in the far future

 

"deletion"

You never delete a record in a SCD 2 table. If a record is no longer valid then you expire it by setting the valid_to_date to yesterdays date.

How to determine if you need to expire a record in your target table depends how you're loading.

Full load: You expire all rows in the target table that don't exist in the source table. This requires that you always get everything active for the current date you're loading (=full)

Delta load: If you only get from source the records that changed then you need either in this source table an indicator column that tells you that this is a "delete" OR you need a separate source file with these deletions.

 

I suggest you use above and give it try to implement. I'm sure this forum will be happy to help if you get stuck with design and or some coding. But I guess you will first need to put in some own work and then ask for help sharing what you've already done.

 

Some design considerations

If the changes to data per iteration is low (in percentage) then I'd go for a change in place (Proc Insert/Proc Update or eventually a data step modify statement).

If the changes to data per iteration are high (in percentage) then I'd recreate the table using a data step merge as when re-writing the table you can implement the whole SCD2 load in a single data step.

 

Backing out data after a bad data load and restoring an SCD2 table to a previous state can become rather involved (been there, done it).

I'd always create a backup of the SCD2 target table before a load and would keep as many backups as you believe it could reasonable take to detect a bad data load.

 

I'm not a big fan of SCD2 because it leads to a constantly growing table with the effect of deteriorating performance both for loading and using the table. To maintain a reasonable performance over time you will also need to design and implement some housekeeping process that moves out old history records to some backup table (like was valid more than a year ago and is not the most current record). 
You need to re-create the SCD2 table as part of this removal process because a deletion in place via Proc SQL only deletes rows logically and though wouldn't reduce the physical table size.

 

I prefer for storing history to create two tables, one for the current records and one for the history records (versions, audit, however you want to call it).

 

Reeza
Super User

And check if your org has tools that does this. We don't do any of this manually. We specify the dimension type, the data sources and it does it 🙂

 


@Patrick wrote:

As @Reeza already stated what you describe and show us is a slowly changing dimension type 2 (SCD 2). 

Your expectation for someone to provide ready-made fully tested code for your data free of charge is a bit much for a forum where users help users. 

 

What you need to do:

1. for any record in source work out if it's a change to an existing record or a new record

2. for new records: Insert a new record into the target table (valid_from_date set to todays date, valid_to_date to some date in the far future)

3. for updates of an existing record: A) expire the existing record (change valid_to_date to yesterdays date), B) insert a new record with valid_to_from_date set to today's date and valid_to_date to some date in the far future

 

"deletion"

You never delete a record in a SCD 2 table. If a record is no longer valid then you expire it by setting the valid_to_date to yesterdays date.

How to determine if you need to expire a record in your target table depends how you're loading.

Full load: You expire all rows in the target table that don't exist in the source table. This requires that you always get everything active for the current date you're loading (=full)

Delta load: If you only get from source the records that changed then you need either in this source table an indicator column that tells you that this is a "delete" OR you need a separate source file with these deletions.

 

I suggest you use above and give it try to implement. I'm sure this forum will be happy to help if you get stuck with design and or some coding. But I guess you will first need to put in some own work and then ask for help sharing what you've already done.

 

Some design considerations

If the changes to data per iteration is low (in percentage) then I'd go for a change in place (Proc Insert/Proc Update or eventually a data step modify statement).

If the changes to data per iteration are high (in percentage) then I'd recreate the table using a data step merge as when re-writing the table you can implement the whole SCD2 load in a single data step.

 

Backing out data after a bad data load and restoring an SCD2 table to a previous state can become rather involved (been there, done it).

I'd always create a backup of the SCD2 target table before a load and would keep as many backups as you believe it could reasonable take to detect a bad data load.

 

I'm not a big fan of SCD2 because it leads to a constantly growing table with the effect of deteriorating performance both for loading and using the table. To maintain a reasonable performance over time you will also need to design and implement some housekeeping process that moves out old history records to some backup table (like was valid more than a year ago and is not the most current record). 
You need to re-create the SCD2 table as part of this removal process because a deletion in place via Proc SQL only deletes rows logically and though wouldn't reduce the physical table size.

 

I prefer for storing history to create two tables, one for the current records and one for the history records (versions, audit, however you want to call it).

 


 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1437 views
  • 4 likes
  • 4 in conversation