06-06-2016 01:41 PM
I'm on SAS 9.4 Linux, reading a large amount of data (50M records, 20 variables) from a SQL server databae via explicit pass through. So code is something like:
proc sql; Connect to sqlsvr as mydatabsae (datasrc=mydatabsae user='me' AUTHDOMAIN="mydomain" readbuff=3000); create table want (compress=yes) as select * from Connection to mydatabase ( select * from mydatabase.mytable /*actual query joins a few tables and
has a where clause to select 50M records
*/ ) ; quit;
When this query executes on a SAS server close to my database (in both geographic space and network space) it might take 2 hours, but when the query executes on a SAS server far far away, it might take 6 hours.
It feels like the run time difference is time to transfer data (I'm not totally convinced this is the only issue, as I haven't yet controlled for other folks running big jobs against the databsae at the same time as mine, etc).
I'm looking for tips on how to decrease the run time, and also how to test various efficency improvments.
I have started playing with the readbuff option, which seems to be the most obvious place to start (after optimizing the query itself).
Are there other system options I should consider, to improve read efficiency / transfer efficiency?
If I play with SASTRACE option or similar, is there a setting that would help me distinguish between how long it took the query to run on the SQL server database, and how long it took to transfer the results to my SAS server?
I'm having a hard time testing impact of different tweaks, I think due to database caching. If I run the same explicit pass-through query twice, usually the second time it will execute much faster. Is there a way in my explcict pass-through query to tell SQL server not to use database caching, so that it reads the data from disk every time? Anything else I could do to try do decrease the variability I see in run times (other thank kick everybody off the database server and SAS server, which isn't going to happen. : )
06-06-2016 02:18 PM
These are tricky...
You have 4 "pieces" to the performance puzzle:
1. Data being extracted from SQL*Server. Joining, and "limiting" (I use the term loosely) your response to 50 million records, is a big job; you can expect to wait. Also, database response is inherently variable, depending on the size and nature of the load from other users. One thing that is usually suggested is to run the retrieval just on the database to a local file, and see if that takes an order of magnitude less, but in your case I think that your two hour response sounds reasonable. Of course, there can be database tricks (indexes, etc.) that could speed up your retrieval, but that's a job for a SQL*Server expert.
2. Nature of the data being extracted. You mention 20 variables, but the type of variable matters. If these are SAS numeric (64 bit floating point) on SQL*Server, you're pulling around 8 gigabytes over your comms line. If they're 1 byte character, only around 10% of that. If they're 100 byte character, 10 times more.
3. Communications link. Often this is a problem, with SQL*Server on the sending end and SAS on the receiving, just putting along waiting for data to squeeze through a thin pipe. This is a really easy one to test; figure out your total data volume, find or create a dataset (doesn't matter what) that's about 10% of the volume, and send it back and forth via FTP a few times. See i) how long it takes, and ii) what kind of variability you see.
4. SAS on the receiving end. Very unlikely that this is the problem, but if you're running on a really underpowered server, it could be. How long does it take to copy your 50m record dataset once it's in SAS?
Your idea about increasing the data volume per transfer is really good; I don't know enough about SQL*Server to know if you're doing it right.
Good luck, and post back with whatever you find out. This stuff is really useful to all of us!
06-06-2016 03:47 PM
In addition to @TomKari's excellent advice a couple of thoughts:
If you change your query to just a SELECT count(*) this will give you the time required just to do the query itself.
Try larger values for the READBUFF option on the CONNECT statement do see if that has any impact.
06-07-2016 07:00 AM
If the same query with the same SAS system options set om different SAS severs differs that much, I'd put my money on the network. Potentially, it could be the write performance on the target libname (this could easily be tested by having a data _null_ step read a view with the same query).
Since you are doing a explicit SQL pass thru, much help from SASTACE can't be expected - the major benefit from that is to show what the SAS SQL transforms to in the underlying DBMS.
06-09-2016 01:52 PM
Thanks all for your helpful suggestions. I'm still exploring, will write more when I come to more conclusions (hopefully).
My hunch is still focussed on the network. I ran some more tests from the two SAS servers, and the distant server was consistently longer.
I have not yet figured out how to distinguish between query time and transfer time. Unfortunately, I don't think I can use @TomKari's suggestion to ftp a sample file over the wire. I think the only port that is open to me on the database server is whatever port is used for ODBC connections. I don't think I could get SFTP working, or get somebody to put a file on that server that I could FTP (but I may need to try harder). I also tried running the query using SQL Server Management Console from my PC, but even there, was disappointed there does not seem to be a way to separate execution time of the query from time to deliver transfer the resulting records back to the client. I need to play more with @SASkiwi's suggestion to use select count(*) to avoid transmitting records while still executing the same query. I'm just hoping the optimizer won't be smart enough to short-circuit that. @LinusH is correct that SASTRACE doesn't seem to help me at all. I suppose what I really need is a DBA on the database side who could turn on logging and tell me what they're seeing. And also a network admin...
09-27-2016 04:45 PM
You may be able to run an EXPLAIN PLAN command to see if the two databases are using the same strategy to execute the query. It could be the case that there is a missing index, or stale statistics, on the remote database. This will help you rule out "the database." Other good ideas have been mentioned:
Your example includes a SELECT *. Limiting the columns to only those you need may help increase your performance. This is very true if you are reading large character columns. Omitting these columns or alterring the resulting SAS formats may help here.
Using the FULLSTIMER option can help you determine if the time is being spent in SAS. It is probably being spent in network traffic.