I'm working on a project to optimize the flow of an inefficient reporting engine that was created over time by thousands of lines of user written SAS code. Among other things, the code constructs Data Sets from flat file extracts from our Teradata platform and in turn produces other extracts to be read by yet another stream of SAS code. That said, while incredibly inefficient, the week long process produces the output management expects.
I would like to suggest we move towards a RDBMS (Oracle or Teradata) with a proper ETL tool to replace the SAS-based batch jobs. However, I can't expect our user community to drop everything and start writing SQL.
1. To what extent will we have to rewrite the existing SAS report code if we move to SAS/Access for Teradata/Oracle? Is it simply a matter of changing the "LIBNAME" statement?
2. If I create a Data Set from a RDBMS table, is the data copied over or just the results of the query?
3. Which environment is the SAS/Access product used more often? I'd be driving the RDBMS selection based on the potential to use SAS/Access. My choices are probably limited to Teradata and Oracle (in that order) but I could do MS SQL if necessary.
4. Is there a good diagram and/or powerpoint that I can use to sell my vision to the team? (I have to convince my boss first).
5. Are the any known issues/flaws with the product? I'm putting my neck on the line.
6. Anything else I should know ?
SAS can be a "proper ETL tool". The legacy problem may be in not knowing how to do ETL (or the code is old enough that the structured approach to ETL was not developed).
1) If the tables in the RDBMS look just like the SAS datasets, then changing the libname could work.
2) It depends on how you create the dataset. Sometimes it pulls entire tables and other times it is just the results of a query. It is a function of whether the SAS functionality that you are trying to use is also in the RDBMS. (You have more control with a Pass-Thru query, but then you are using just the RDBMS
3) I've never used Teradata, but know plenty of users of Oracle. Oracle is a bit better supported on non-PC SAS platforms than MS SQL.
4-6. don't know. It's largely a function of what system was trying to do.
Given your description, I don't think the choice of ACCESS engine is the most critical part. Not even where to store the data. I think that the application will benefit most from a re-design´. The concussion could be that storing the data in SAS will be as good as in an external RDBMS. But I think you could benefit a lot from reading Teradata directly, and do as much processing in the source RDBMS. To give you more detailed tips, I think one has to get involved in project. You need advice from a DW architect.
If you already have a Teradata platform, I think one good first step might be getting SAS/Access for Teradata and figuring out the appropriate libname statement to access your specific database (also contact the Teradata DBA to make sure you have permissions to read). I think you'll need that down the road anyway for what you're proposing.
The next step would be to see how much modification some of your existing flat file to SAS data code would require now that you're connecting directly to the Teradata tables. If you have a lot of data manipulation (creating columns, cleaning data, etc.) in thousands of lines of SAS code you describe, you may still need some of that in the revised solution. If the code is mostly just to read the flat files, that part is obsolete with the advent of the LIBNAME statement for Teradata. The reality of all ETL tools is that cleaning and prepping the data will take time, system resources, and usually require scheduling.
Data manipulation you can do in PROC SQL may be vastly prefered to Data step code since some/all of the PROC SQL code can be passed through to the database for processing.
Just a few thoughts from someone who is definitely not a data warehouse architect. So take this with a grain of salt...
I found working with ad-hoc sas reporting and a teradata platform a nightmare until I realised I must do the data management in the warehouse. OK even if I prefer proc means for summarising data, I prefer not to wait for the data to arrive, so the corresponding sql query became a familiar model.
(learning experience was interesting too ~ one unsuitable program unbalanced a billion rows in a query ~ using an in() list ran very very much faster than a join through a subsetted dimension ...... ).
Reverse the reporting model that pulled all data into SAS for manipulation.
Internally, Teradata platform operates so differently, that you might want to take on board some of the training normally required to licence a developer on that platform.
Teradata is an obvious choice for the "SAS-in-database" strategy. There should be a few links to be found about that.