10-18-2017 11:14 AM
Hello all you very smart SAS programmers. I have a question for you about processing of SQL statements on an Oracle database. Here is the basics of what I am experiencing.
I am use an explicit pass through to an Oracle database via proc sql. I wrote a SQL statement that returned roughly 2.1 million records with a run time of about 2 hours. I realized that I was returning more records then I needed so the only change I made to the code was to add a UNIQUE to the select statement. This returned about 150,000 records and increased the run time to about 10 hours. I was in shock! How could a more restrictive query that is designed to return fewer records run loner and not just longer but five times as long.
The obvious answer to me was that there was some sort of system or connectivity issue so I repeated the experiment. It has turned out to be very consistent. My first thought was that I was being a silly little programmer and that I was breaking the pass through by using SAS specific syntax and that to process the unique statement data had to be returned from the native server to the SAS server I am running on. But since I am using an explicit pass through, the statements would not compile or execute if that was the case.
The second thought I had was that UNIQUE is handled by Oracle in some fashion that I didn't anticipate. I did some research on UNIQUE versus DISTINCT hoping to revealed how SQL handles those statements. All I can determine is that the two are synonymous and should return the same results. Nothing about processing. As an experiment, I am going to re-run using a DISTINCT in place of the UNIQUE.
The final thought that I had was in regards to indexes. I know the Oracle table I am pulling from is indexed and I have two WHERE clauses that reference those indexes. One explanation for the increased run times would be that the UNIQUE statement creates a processing situation whereby the indexes are ignored. If this is the case, I can accomplish the same thing as the UNIQUE by using a GROUP BY clause which should then decrease my run time from the 2 hour mark. I am also going to try this as an experiment.
So, after all of that, I'd like to understand better how the UNIQUE or DISTINCT statements are handled by Oracle so that I can predict when using that code is an acceptable solution or when it should be avoided. Does anyone know how the processing works or have any resources they can point me to where I can learn for myself?
10-18-2017 11:29 AM
10-20-2017 12:08 PM
Yeah, I know it's an Oracle thing but I wanted a SAS user perspective rather than just the Oracle. At the end of the day, I need to know how to use SAS in the Oracle environment and Oracle people can't help with that.
That being said, I went to one of our Oracle experts (I didn't know she was an Oracle expert or I would have started with her). She did give some things to optimize but we think we figured out what was going on. I think this is probably of interest to SAS coders and I'm curious if proc sql handles this the same way. It turns out that the DISTINCT statement executes after all other selecting is done. So in my case, after the primary query had returned the 2.1 million records the DISTINCT then executed to reduce the size down to 120,000. In the execution of the DISTINCT (in Oracle at least, I'm curious as to proc sql and other database types) there is an implicit sort that occurs. As we all know, sorts can be very memory intensive and this is what was causing the poor time performance. As a user I am only allocated a limited amount of memory when I connect to the database. The database had to hold the 2.1 million records, perform the sort, and store the resultant table. Because of the size of the dataset, the majority of my allocated memory was already occupied. In order to accomplish the DISTINCT, Oracle was performing all sorts of gymnastics to open up space for temporary use, use it, and then clear it again. The solutions are to be allocated more memory (not going to happen), use less memory (i.e. return fewer records) or not perform the DISTINCT. I went with option number three.
As an aside, I was also able to find a different index to select on and added some other code efficiencies that I was unaware of. The end result was twofold, I reduced the run time from 2 hours down to under 5 minutes and I made a new Oracle friend