Flat file vs. Database -- there is no easy answer.
1) This is somewhat politically charged with SAS, as SAS and its datasets and various engines can be used as/for (and technically is) a database.
2) Most relational type database management systems -- Oracle, DB2, etc. -- are more than just an RDMS, but also are transactional engines, providing ACID properties, which include abilities like being able to roll back a transaction. Base SAS, by itself, without explicit user programming, does not do this.
SAS does not necessarily support sharability of data easily either. This is simply evidenced by the fact that yesterday, my production loading of a permanent SAS dataset failed because someone had accessed that file with EG the day before and had left for the night without shutting down their EG.
While SAS does support multiple threads, per se, it is not inherently multi-threaded, as it's heritage is in extremely fast sequential processing on a mainframe.
In a distributed environment of SAS servers and libraries or files and data, it is generally fastest to run initial selects and summarizations on the server that holds the data. What is reported back to EG is always a simple limited subset of the results.
If I were to design a distributed SAS system from scratch -- which I have -- each user gets EG, SAS itself would live on an AIX box, and in the Mainframe sysplex if there is one. I would not necessarily have SAS solely on a Mainframe (topic for a different time/thread). Since some features of SAS work best/only on a Windows box, I would have at least one Windows server. Power users and developers would get PC SAS on their workstation with extra memory and at least one extra hard drive.
So far, it has been my observation that EG is a great ad hoc analysis and reporting tool, but has many shortcomings for regularly sheduled batch processing. That's why you need SAS developers for creating those things.
My limited experience to date with SAS's MetaData Server is not a happy one. But, a properly administrated meta-data repository for EG is essential and very helpful. This has to be on a separate server, and I believe it still has to be a Windows box (ugh).
The only things possibly faster than SAS at reading a flat file may be a well written compiled COBOL or C program, maybe. But, it'll take a programmer longer to write these COBOL or C programs than it will a SAS programmer, and SAS is superiorly flexible at it.
I am a firm believer that data should be placed into an RDMS to provide ready access to many different application programs. Only SAS can read SAS datasets, as far as I know. RDMS's provide better security controls, transactional process controls, etc. While much of this can be done with/in SAS, it is a lot more work because the core competency of SAS is data analysis of large datasets and simplifying the execution of highly sophisticated statistical analyses.
All kinds of things can be done RDMS systems to improve their performance from many aspects. Modern RDMS's are highly threaded and will outperform SAS on initial queries/views of complicated data structure tables. Indexing is easier to manage and maintain. Data integrity is easier to mainain. All sorts of things.
But, after I have pulled the data I want from the database, I want it on the box that is running SAS, and no DMBS nor its language is going to allow me to process that data faster for the results I need, and the programming, If I had tried, would have been a lot harder, or in some cases impossible (Oracle's PL/SQL just can't do some things that SAS can -- but, it wasn't supposed to in the first place).
SAS can be used to load a database -- using SAS for ETL.
Some prefer to use a special purpose ETL tool like Informatica.
I tended to use Oracle's SQLLoader.
This issue is non-trivial, and no matter how things are sliced and diced, proper ETL is a messy complicated thankless yuck job, especially when the data/process blows up -- who can think of all the Murphyisms? and then there's the issue of the human process(es) involved in resolving them.
If you are lucky, all business transaction processing is already in an RDMS and you can just simply access it.
Minimally, the EG metadata repository will define users, put the users into groups, and define a logical SAS server for each group. These logical group servers will have access to only that group's data (generally their private PSDS's and libaries). Then there will be a logical SAS server that has publicly available libaries assigned to it.
For example:
Mainframe = SAS_Main and has libraries defined for SMF records.
SAS_Public = on AIX box and contains RDMS library defintions for the Data Warehouse.
Marketing = on AIX box, home = /home/marketing. LDAP tools allow only employees of the Marketing department to use Marketing and to access /home/marketing on the box when they log in. /home/marketing is a 200 GB filesystem. It contains two other independent filesystems /home/marketing/sas_work and /home/marketing/sas_sort, each are 100 GB.
Research = on AIX box, home = /home/research. ditto Marketing
Operations = on AIX box, home= /home/operations. ditto Marketing
Business Analytics = BusAnals = on Windows box, home = \\sasprd\Business Analytics. Still ditto Marketing, with the difference that Business Analytics is a share on the box, the real base directory is F:\sas_users\Business Analytics. Business Analytics is on a Windows server because they can't let go of their use of Excel spreadsheets.
Each of these logical servers has its base access directory as the home directory, not the system directory, for security reasons.
The next rule, is that there is a standing rule, that the first item in each project is an "Assign Library" creating EGTASK that points to a directory (folder) specifically and only for that project, to hold the project file, and all the generated datasets. This provides for significantly easier disk space maintenance.
Sorry for the small book, but I hope this helps.
Message was edited by: Chuck