Hi Patrick,
Thanks for your input to my question! Great to hear what is being considered!
A valid question you are asking: am I talking about "keep older versions of data" or talking about backup/recovery? ->> I am specificially talking about "keep older versions of data" and NOT backup/recovery
Both data archiving and backup/recovery aree two very valid topics in any DW implementation - but for this post here - archiving data is what I am interested in.
I can add that for backup and recovery - Margaret Crevar from SAS has created this paper for SGF2017:
http://support.sas.com/resources/papers/proceedings17/SAS0565-2017.pdf - good reference for the dialogue with IT regarding proper handling of backup/recovery procedures for the SAS platform - not a simple task to get right.
Back to the topic: Data Archiving
Basically it is the "data life cycle" that I am addressing here - and how to handle the archiving part:
From "birth" in source systems -> entry into DW/Analytical architecture -> to retirement (available but perhaps rarely used) -> deletion and not available any more.
There can be many reasons why archiving is relevant:
Ever increasing data volume
Trying to reduce price pr. GB by moving to different lower cost disc-location
Trying to ensure optimal performance in production DW/Analytical environment by reducing number of records
Compliance reasons: regulations e.g.. upcoming EU GDPR regulation (ensure you are only processing data you are legally allowed to process)
Ensure general business requirements are meet for what ever reason that may be valid
Project/solution lifetime is long enough and therefore data archiving will be relevant. For more short-term implementations data archiving may not be relevant.
Then on top of the above you need to add your own policies/governance about what/when to archive.
As you mention prices per. GB is getting lower every day - and that is positive for current and future implementations - BUT I have no doubt that a new lower cost will only mean that cost baseline moves and therefore still will lead to thoughts on how money can be saved. New technologies like Hadoop also are showing a price/GB that business are seeing as lower than traditional SAN disc implementations.
Whether the it is SAS platform or other ETL tools that are being used - data archiving seems relevant. I have seen plenty of very large DW implementations with the use of SAS Data Management platform, DI Studio, DataFlux tools, SPD Server - with specific SAS solutions on top.
The type of data archive that is implemented should adhere to more specific business requirements - like:
Should I be able to access data with XYZ tool?
Should data archive be considered as a part of the normal production environment and therefore available 24/7/365?
Should there be backup of the archive?
Etc…
Then the actual implementation can be done in ways that you are describing - as in integral part of the normal ETL routines in the DW setup - with use of database features like partitioning.
Personally I have not decided whether which one of the two below approaches I like more:
Build archive functionality along with normal ETL development during DW project implementation
Build afterburner ETL/programs that can processes selected relevant data for archive purposes
Manual approach where data are moved based on individual choices
I am leaning towards the second approach, since the impact on the initial implementation project could be fairly low: design and ideas could naturally be surfaced early - it would seem obvious that thoughts on data model implementation can affect how data archiving can be done afterword's. You can then develop the required ETL jobs when required and not impact the normal project implementation. On the other hand - it can be difficult to get the funding for something that perhaps should have been done earlier??? With plenty of current DW implementations - the afterburner approach would be more realistic for many customers now.
All for now.
Please drop more ideas and thoughts for the topic!
Regards
Jan
... View more