SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

General activities on "data archiving" in a SAS Data Warehouse environment

Reply
SAS Employee
Posts: 4

General activities on "data archiving" in a SAS Data Warehouse environment

Hi there,

 

I am working on a case where we are looking at establishing an archive function on the "back" of the SAS DW implementation. I have been trying to find ressources on data archiving and SAS - but it has been very limited what I have found at this stage. During many DW projects, scoping, estimating etc. - data archiving has rarealy been discussed - but in this time and age, it seems more and more obvious that I probably needs to be a more proactive required choice in a DW implementation. And now to the question:

What are you doing out there at the SAS world. Are you adding extra disc-technology, eg. adding Hadoop, creating postprocesses, buying special DW appliance technology when trying to cater for large and expanding data volume in your environments.

 

Please drop me your thoughts for an open dialogue here.

 

Thanks in advance

Jan zilstorff

Respected Advisor
Posts: 3,901

Re: General activities on "data archiving" in a SAS Data Warehouse environment

[ Edited ]

@HrZiller

Do you mean by archiving keeping old versions of data in a separate area or are you talking about backup and recovery?

 

backup and recovery

I make this normally someone elses "problem. This is either data in a data base or (SAS) files on a file system. All reasonably mature organizations have already processes in place. My role is then just to define the volume of data for backup and recovery and in bigger organisations also the level of data security and SLA required (as this can make a huge difference to cost).

 

Archiving

I found the following definition:

"Data archiving is the process of moving data that is no longer actively used to a separate storage device for long-term retention. Archive data consists of older datathat is still important to the organization and may be needed for future reference, as well as data that must be retained for regulatory compliance."

 

I guess here is the question a bit about the "why", cost/benefit and if/how often and with what acceptable delay you still might require access to the archived data. 

 

What I've done myself in the past or have other's seen doing.

1. For transactional data (facts tables) stored under Oracle: Create montly partitions and implement in Oracle a process which compresses such partitions after a while (and there are different levels of compression) and also moves old partitions to cheaper storage - and in the end delete them (other DB's should allow for comparable approaches, i.e. Teradata can have processes in place which moves data which is not "hot" to cheaper and slower storage).

2. For dimensional data: Move the data to a history table. That's an expensive process as you need to implement an own set of archiving ETL's to determine which data can get moved. You also need to maintain full referential integrity in your database. That can be costly to implement and it's not always worth it. You can go for a data model where you keep history (old versions of data, in a _Version table and you can have these tables on cheaper storage and/or compressed but then slower to access).

3. At a few places where it was about data marts implemented as montly full snapshots stored as SAS files: I went for a process which after a few months zipped these SAS files (reducing storage space up to 90%) and then deleted zip files older than 2 years.

 

 

If you don't have to comply with regulatory requirements then I'd go for an approach where you design and implement some sort of archiving for fast growing transactional data (which is not that hard) but you make a good cost/benefit analysis for lower volume and much slower growing/changing referential and dimensional data.

 

I've done quite a bit of DI work but I've never been involved in a project where SAS had been used as ETL tool for a large EDW. The projects I've been involved where more about SAS solution specifc or Analytical DW's - and these tend to change faster than an EDW.

So besides of growing data volumes I'd also ask: How long is the current data model going to survive and when do I expect the next data migration to a new model. And if that's below 10 years then it's eventually not the wrong call to decide to spend the "archiving" effort as part of data migration by just not migrating everything. ...and then: Storage space is still getting cheaper and cheaper and I/O faster and faster. So also here: Is the expected data growth and cost of archiving processes (design, implementation, operations) worth it?

 

I'm sure there are many more valid options out there and what's right will "depend".

 

Thanks,

Patrick

 

SAS Employee
Posts: 4

Re: General activities on "data archiving" in a SAS Data Warehouse environment

Hi Patrick,

 

Thanks for your input to my question! Great to hear what is being considered!

 

A valid question you are asking: am I talking about "keep older versions of data" or talking about backup/recovery? ->> I am specificially talking about "keep older versions of data" and NOT backup/recovery

 

Both data archiving and backup/recovery aree two very valid topics in any DW implementation - but for this post here - archiving data is what I am interested in.

 

I can add that for backup and recovery - Margaret Crevar from SAS has created this paper for SGF2017:

http://support.sas.com/resources/papers/proceedings17/SAS0565-2017.pdf - good reference for the dialogue with IT regarding proper handling of backup/recovery procedures for the SAS platform - not a simple task to get right.

 

Back to the topic: Data Archiving

 

Basically it is the "data life cycle" that I am addressing here - and how to handle the archiving part:

From "birth" in source systems -> entry into DW/Analytical architecture -> to retirement (available but perhaps rarely used) -> deletion and not available any more.

 

There can be many reasons why archiving is relevant:

  • Ever increasing data volume
    • Trying to reduce price pr. GB by moving to different lower cost disc-location
    • Trying to ensure optimal performance in production DW/Analytical environment by reducing number of records
  • Compliance reasons: regulations e.g.. upcoming EU GDPR regulation (ensure you are only processing data you are legally allowed to process)
  • Ensure general business requirements are meet for what ever reason that may be valid
  • Project/solution lifetime is long enough and therefore data archiving will be relevant. For more short-term implementations data archiving may not be relevant.

 

Then on top of the above you need to add your own policies/governance about what/when to archive. 

 

As you mention prices per. GB is getting lower every day - and that is positive for current and future implementations - BUT I have no doubt that a new lower cost will only mean that cost baseline moves and therefore still will lead to thoughts on how money can be saved.  New technologies like Hadoop also are showing a price/GB that business are seeing as lower than traditional SAN disc implementations.

 

Whether the it is SAS platform or other ETL tools that are being used - data archiving seems relevant. I have seen plenty of very large DW implementations with the use of SAS Data Management platform, DI Studio, DataFlux tools, SPD Server - with specific SAS solutions on top.

 

The type of data archive that is implemented should adhere to more specific business requirements - like:

  • Should I be able to access data with XYZ tool?
  • Should data archive be considered as a part of the normal production environment and therefore available 24/7/365?
  • Should there be backup of the archive?
  • Etc…

 

Then the actual implementation can be done in ways that you are describing - as in integral part of the normal ETL routines in the DW setup - with use of database features like partitioning.

 

Personally I have not decided whether which one of the two below approaches I like more:

  1. Build archive functionality along with normal ETL development during DW project implementation
  2. Build afterburner ETL/programs that can processes selected relevant data for archive purposes
  3. Manual approach where data are moved based on individual choices

 

I am leaning towards the second approach, since the impact on the initial implementation project could be fairly low: design and ideas could naturally be surfaced early - it would seem obvious that thoughts on data model implementation can affect how data archiving can be done afterword's. You can then develop the required ETL jobs when required and not impact the normal project implementation. On the other hand - it can be difficult to get the funding for something that perhaps should have been done earlier???  With plenty of current DW implementations - the afterburner approach would be more realistic for many customers now.

 

All for now.

 

Please drop more ideas and thoughts for the topic!

 

Regards

Jan

Ask a Question
Discussion stats
  • 2 replies
  • 141 views
  • 1 like
  • 2 in conversation