05-01-2017 07:01 AM
I am working on a case where we are looking at establishing an archive function on the "back" of the SAS DW implementation. I have been trying to find ressources on data archiving and SAS - but it has been very limited what I have found at this stage. During many DW projects, scoping, estimating etc. - data archiving has rarealy been discussed - but in this time and age, it seems more and more obvious that I probably needs to be a more proactive required choice in a DW implementation. And now to the question:
What are you doing out there at the SAS world. Are you adding extra disc-technology, eg. adding Hadoop, creating postprocesses, buying special DW appliance technology when trying to cater for large and expanding data volume in your environments.
Please drop me your thoughts for an open dialogue here.
Thanks in advance
05-02-2017 06:32 AM - edited 05-02-2017 06:40 AM
Do you mean by archiving keeping old versions of data in a separate area or are you talking about backup and recovery?
backup and recovery
I make this normally someone elses "problem. This is either data in a data base or (SAS) files on a file system. All reasonably mature organizations have already processes in place. My role is then just to define the volume of data for backup and recovery and in bigger organisations also the level of data security and SLA required (as this can make a huge difference to cost).
I found the following definition:
"Data archiving is the process of moving data that is no longer actively used to a separate storage device for long-term retention. Archive data consists of older datathat is still important to the organization and may be needed for future reference, as well as data that must be retained for regulatory compliance."
I guess here is the question a bit about the "why", cost/benefit and if/how often and with what acceptable delay you still might require access to the archived data.
What I've done myself in the past or have other's seen doing.
1. For transactional data (facts tables) stored under Oracle: Create montly partitions and implement in Oracle a process which compresses such partitions after a while (and there are different levels of compression) and also moves old partitions to cheaper storage - and in the end delete them (other DB's should allow for comparable approaches, i.e. Teradata can have processes in place which moves data which is not "hot" to cheaper and slower storage).
2. For dimensional data: Move the data to a history table. That's an expensive process as you need to implement an own set of archiving ETL's to determine which data can get moved. You also need to maintain full referential integrity in your database. That can be costly to implement and it's not always worth it. You can go for a data model where you keep history (old versions of data, in a _Version table and you can have these tables on cheaper storage and/or compressed but then slower to access).
3. At a few places where it was about data marts implemented as montly full snapshots stored as SAS files: I went for a process which after a few months zipped these SAS files (reducing storage space up to 90%) and then deleted zip files older than 2 years.
If you don't have to comply with regulatory requirements then I'd go for an approach where you design and implement some sort of archiving for fast growing transactional data (which is not that hard) but you make a good cost/benefit analysis for lower volume and much slower growing/changing referential and dimensional data.
I've done quite a bit of DI work but I've never been involved in a project where SAS had been used as ETL tool for a large EDW. The projects I've been involved where more about SAS solution specifc or Analytical DW's - and these tend to change faster than an EDW.
So besides of growing data volumes I'd also ask: How long is the current data model going to survive and when do I expect the next data migration to a new model. And if that's below 10 years then it's eventually not the wrong call to decide to spend the "archiving" effort as part of data migration by just not migrating everything. ...and then: Storage space is still getting cheaper and cheaper and I/O faster and faster. So also here: Is the expected data growth and cost of archiving processes (design, implementation, operations) worth it?
I'm sure there are many more valid options out there and what's right will "depend".
05-02-2017 08:36 AM
Thanks for your input to my question! Great to hear what is being considered!
A valid question you are asking: am I talking about "keep older versions of data" or talking about backup/recovery? ->> I am specificially talking about "keep older versions of data" and NOT backup/recovery
Both data archiving and backup/recovery aree two very valid topics in any DW implementation - but for this post here - archiving data is what I am interested in.
I can add that for backup and recovery - Margaret Crevar from SAS has created this paper for SGF2017:
http://support.sas.com/resources/papers/proceedings17/SAS0565-2017.pdf - good reference for the dialogue with IT regarding proper handling of backup/recovery procedures for the SAS platform - not a simple task to get right.
Back to the topic: Data Archiving
Basically it is the "data life cycle" that I am addressing here - and how to handle the archiving part:
From "birth" in source systems -> entry into DW/Analytical architecture -> to retirement (available but perhaps rarely used) -> deletion and not available any more.
There can be many reasons why archiving is relevant:
Then on top of the above you need to add your own policies/governance about what/when to archive.
As you mention prices per. GB is getting lower every day - and that is positive for current and future implementations - BUT I have no doubt that a new lower cost will only mean that cost baseline moves and therefore still will lead to thoughts on how money can be saved. New technologies like Hadoop also are showing a price/GB that business are seeing as lower than traditional SAN disc implementations.
Whether the it is SAS platform or other ETL tools that are being used - data archiving seems relevant. I have seen plenty of very large DW implementations with the use of SAS Data Management platform, DI Studio, DataFlux tools, SPD Server - with specific SAS solutions on top.
The type of data archive that is implemented should adhere to more specific business requirements - like:
Then the actual implementation can be done in ways that you are describing - as in integral part of the normal ETL routines in the DW setup - with use of database features like partitioning.
Personally I have not decided whether which one of the two below approaches I like more:
I am leaning towards the second approach, since the impact on the initial implementation project could be fairly low: design and ideas could naturally be surfaced early - it would seem obvious that thoughts on data model implementation can affect how data archiving can be done afterword's. You can then develop the required ETL jobs when required and not impact the normal project implementation. On the other hand - it can be difficult to get the funding for something that perhaps should have been done earlier??? With plenty of current DW implementations - the afterburner approach would be more realistic for many customers now.
All for now.
Please drop more ideas and thoughts for the topic!