05-25-2015 10:48 AM
We were asked to move all the SQL server / teradata data to SAS grid to make the grid as a Data warehouse.
I can understand this is possible, tried looking online if there are any examples, but haven't found any.
We will have about 200GB data total to move and have 1TB size allocated in the grid. I also noted SAS takes larger space than the Teradata table when I tried a 10GB size table to SAS.
I would like to hear from the experts here:
1. Whether or not SAS grids can be used as a Data warehouse? Any advantages / disadvantages of using SAS grid to store and access data for reporting / analytics purpose?
2. How do you move the data to SAS tables from teradata/sql server quicker (any fastload / bulkload options?)?
3. Why does SAS takes more space than teradata / sql server for the same data (10GB vs. 50GB)?
Thanks to all for your insights!
05-25-2015 02:58 PM
05-25-2015 02:58 PM
05-25-2015 05:13 PM
Thanks for the insights. Here is some additional details:
First, you don't have to move to SAS (unless you wish to save SW licenses). There are examples where SAS can access Teradata installations on the grid, and perform well from what I understand. But, is this a regular job, or one time dump? SAS doesn't have any fast/bulk-load options, since SAS is very light-weight storage system. If you rely heavily on indexing, chose SPDE or evaluate SPD Server - they are much better when it comes to incremental updates on indexes.
The department already got the license for SAS and all. They are moving to SAS for a few reasons, cost (~$2,500/yr for SAS vs. ~$10,000/yr for Teradata) to store 1TB data and thinking data locality will reduce the overhead for reporting purposes using Web reporting studio or similar web interface for the users. We were asked to source the teradata /sql server tables to create ETL processes running on a daily basis that creates mini data marts in SAS along with a one time move of the hist tables containing 30-40 million rows (~10 tables like this). I argued about the network I/O and other issues with the formats that could be faced, and trying to do more research on the process. I tried using views, but at the end it is taking similar time as using table to perform some ETL for creating data marts. Seems like it is going to be more challenging than everyone thinks.
In terms of space issue, I will get the details later this week and post, but understand from both reply that it needs lot of tuning and manual intervention. I will have to study that items Jaap mentioned, will get back later. Thanks!
05-26-2015 03:04 AM
If you have SAS DI Studio licensed, the easiest is to build SAS ETL jobs that extract the required data from Teradata/SQL Server.
Network issues shouldn't be a problem if the shared file system is configured the proper way. There were some very useful papers from SAS Global Forum this year.
05-25-2015 03:31 PM
1/ A grid is a hadoop like approach. You have many machines each supporting some part of the data.
Having a SAN connected to all machines you are introducing the problem of enqueing/locking/synchronization. To solve that you have to build something like Hapdoop or Teradata.
With 200Gb and a total size of 1Tb we are more in the area a single machine (compute tier) could do that well. It is approaching a common desktop capacity.
2/ By default data with SAS is not compressed. SQL-Server and Teradata on the other hand are having those options. With Terada you can have a grid approach (by-net) and hardwarecompression.
SAS datasets are optimized for sequential processing and it shares some of the principles of a DBMS using blocks where the records are stored in.
There are two types of compression with SAS that is a/ RLE (Run length) also named "char" and RDC (RossDataCompression) also known as "binary" SAS(R) 9.4 Statements: Reference, Third Edition (Libname statement)
Having long records (many variables) the binary approach is better. It is adding overhead in a similar way zip is doing. The char options will do having shorter records but with chars having a lot of same (eg spaces) chars.
The binary name is misleading it is not for numerics only but handles the whole sas-record.
3/ There is a lot of going on with SAS datasets on the OS level. Using those you must tune them. Directio (not using system cache) Bufno SASaligniofiles bufsize can have big impact on the performance. Nothing different as a DBA would do at his DBMS, but you have to do that now yourself. Whether a SPDS server would give additional benfits is questionable.
Need further help from the community? Please ask a new question.