Dear SAS-Experts,
My organization has been using SAS Enterprise Guide for a few years now and we finally got our SAS Server up and running. SAS DI Studio, Web Report Studio, etc.
I don’t have the server specs ready but what the IT guys tell me, it is good.
The big question for me as a data manager is if we can now trash our Oracle DWH with annually 15m records and just move the data to SAS. How is SAS performing with large data set e.g. 150m rows over 10 years’ worth of data. The 15m rows annually are basically created by appending two tables from the Oracle transactional system into a large fact table. Where should we do this data extraction and merging, in the Oracle environment (PL/SQL) or using SAS DI Studio and PROC SQL/SAS BASE. Which is performing better? I know I can’t offer enough technical details to solicit a 100% water tight recommendation but could you maybe point out what is best-practice?
Bye
Philip
Philip,
I would do the join in Oracle. I would start by creating it as a view that is limited to only the columns that are needed by SAS. That will minimize the disk storage requirements and the amount of data to be transferred across the network from the Oracle server to the SAS server (they are often on separate hardware platforms). You can treat the Oracle view as a SAS dataset and not copy the file as a first step.
We just tried it with 50m records via proc sql using the Oracle-DB but Oracle dropped out: ORA-12801 P031 😕
Do you have an idea how to only load the DELTA from the Oracle Transaction System in to the SAS data set? Oracle has Materialized Views to only add the actual changes made. How about with SAS DI? I don’t want to grab the whole lot of data every day and it would be great if the delete/update/insert is done automatically. I read about "Change Data Capture" which would require that we introduce a kind of change-log table. Something which is not yet done using Oracle Materialized Views.
What is the business case for moving your Oracle DW to SAS? I suggest the justification would have to be more than a performance gain, if any.
In that case I suggest you start small by trialling one or two data marts in SAS to get a feel for how it performs with your reporting. A lot of SAS sites are now bypassing OLAP and cubes and are using the new in-memory SAS VA technology. We are about to do this ourselves. It appears to be a lot easier than having to pre-guess dimensions and facts with cubes.
Hi,
you are right. The delta problem is addressed somewhere else..
Bye
Any answer to performance questions has to consider the architecture of the SAS host, the architecture of the database, and the speed of the network connections. Also you need to look at the type of data, and how much of that data has variable length character fields.
Systems and databases can be grid or MPP, which means that they have many processors running in a massively parallel manner. These systems increase performance by 10 to 100 or more times, compared with single processing systems. Also the design of the database affects performance, so it's difficult to compare one database in third normal form and to another may non-normal database. But again, it depends on the queries that run on those databases, since most all databases have an issue where a few bad queries can slow down everyone.
Character data is a potential problem since SAS does not have true variable length data (maybe DS2 will make a liar out of me), but databases do have efficient storage for character data. Also character keys are slower and less efficient than numeric keys, which makes a difference when you are deciding to store BIGINT data as characters in a SAS dataset.
Getting down to performance in general, a SAS grid or an MPP is always better if you can afford it. Whether you have a grid or not, you still need to be concerned about data movement over any network, which is why some programs may take a day or so to finish. Your best bet for performance is to return relatively small result sets from the database, and have all the processing occur in-database, which means only the results are returned and no files are transferred across a network to compute those results.
You will really see the performance difference with big data. But to analyze performance, you must consider all the other relevant factors that affect performance. Oracle and SAS can run in a grid or MPP, and Oracle queries can be designed so they run completely in-database, which is generally going to be faster than any number of single processors on one computer. When you consider all of the factors, then you can make better decisions about performance issues and trade offs.
If you supply some more details about your environment and your budget, then performance recommendations are easier to recommend. Even without a budget or an MPP, you can make tweaks to improve elapsed time by a factor of 2 to 5 times. How much improvement do you need, and how would you measure it?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.