04-19-2017 11:48 AM
I'm using an explicit pass-through query from Linux SAS server to pull about 5 million records from a SQL server database that is far from from my SAS server, and it's taking a long time.
Some of the SQL server columns are defined as the antiquated SQL server type NTEXT, which come into SAS as length $1024.
I know that when SAS pulls a VARCHAR column, it writes a fixed length SAS variable, so it has to pad all the values. In an explicit pass-through setting, does this padding happen before the data are passed to the SAS server, or after?
For a little while this morning I convinced myself that the second query below would execute faster, because it CAST() data on the SQL server side to limit the length. But in more recent testing, I don't think that's holding up (I'm testing in separate PROC SQL steps, but not sure how much caching may be happening).
proc sql; Connect to sqlsvr as sql (datasrc=...); create table A as select MyVar length=1 FROM CONNECTION TO sql ( select MyVar as MyVar from MyTable ) ; create table B as select MyVar length=1 FROM CONNECTION TO sql ( select cast(MyVar as varchar(1)) as MyVar from MyTable ) ; quit ;
Would you expect the above two queries to differ in the amount of data they send across the wire?
When pulling VARCHAR data from a database and you know the VARCHAR length is excessive, where is the best place to specify the var length ("best" meaning fastest run-time for the SQL step which downloads data and writes a SAS dataset).
Other tips on increasing efficiency of explict pass-through queries? I've toyed with READBUFF a bit.
04-19-2017 04:30 PM
SAS is not installed at the SQL server, SAS/ACCESS uses RDBMS client software. Given this, I can't see how data type transformation can take place on the RDBMS server side.
So the this question is perhaps more about how SQL Server transfers VARCHAR vs trimmed fixed length CHAR data across the network. Which I dn't know the answer to, but I would suspect that VARCHAR would still be variable length during transfer.
Perhaps your queries isn't typical, but I can't see that explicit pass through would help the performance.
First, check what bandwidth you actually got, your data transfers can't perform better than that.
If your bandwidth is rally poor, obviously perform as much processing in SQL Server before returning any (limited/aggregated) result set to SAS.
04-20-2017 11:50 AM
I guess the question is really what does the ODBC driver used by the SqlSvr engine (which I think is still a driver SAS licenses from Data Direct?) transfer when it is pulling data from a VARCHAR column. But I agree, it would be shocking it it was padding variables to be fixed length *before* it transferred them.
Yes, the bandwidth is definitely the troublesome limiting factor, but unfortunately I need to accept that for the mid-term.
I've been testing a bit more today, but my results are so inconsistent, I'm not sure what to make of them. I think there may be too many uncontrolled variables in play (other jobs on the SAS server, other database jobs, network traffic, database caching, etc) for me to get reproducible results..
I think for now I will go with CASTING the columns inside the explicit pass-through query to have a shorter length. I don't think that should hurt anything, and I think there is at least the potential that it could help the OBDC driver and/or SAS in terms of creating smaller buffers (perhaps not right word) for these variables, and obviously will result in writing less data to disk in the end.
04-20-2017 04:15 PM
We extract a LOT of SQL Server data to SAS on Windows using ODBC and we do not experience any of the slowness you have. The only tweak we use is setting the SAS option COMPRESS = BINARY so all tables extracted are stored compressed on disk (this can save 70 or more percent on space with big varchars). To give you some idea we can extract nearly 100 million rows across around 100 tables in around 30-40 minutes.
04-21-2017 03:33 AM
Compressing *may* reduce resource consumption for IO on the local SAS server, not bandwidth.
A way to see if there is an IO or network bound problem, use a data _null_ step to read a SQL Server table.
04-21-2017 06:34 AM
04-21-2017 10:14 AM
Thanks @SASKiwi and @LinusH. I'm confident it's network bound. I/O in the work library is fine. And if I query a SQL server that is in the same cage as the SAS server, performance is plenty fast, like what @SASKiwi reported. Unfortunately we are currently breaking the rule of moving your data and analytics close to each other, and instead for this project the database server and SAS server are on different continents with an ocean in between and different networks so plenty of hops/firewalls/port forwarding going on. So there's not much hope of a fast transfer in the current environment, I just want something tolerable for the ETL jobs that will run at night.
And thanks @Patrick I had not known about ansi_padding. If these fields were actually loaded into the database with trailining blanks that are not being trimmed, that could certainly explain why it looks like casting the variables is helping. This data starts in an Acces file that is loaded to SQL server by a DBA, and then I pull from SQ server. Some of my columns also have a depecrated SQL server type (ntext). Very interesting. I think I'll have to explore this a bit more and chat with the DBA. More links: https://docs.microsoft.com/en-us/sql/t-sql/statements/set-ansi-padding-transact-sql http://wiseman-wiseguy.blogspot.com/2008/05/ansipadding-and-variable-length.html.
[Yes, I know SAS could just read the ACCESS file, but due to the odd network setup, the SQL server database can see the Access file, and the SAS server can see the SQL server database- crossing some firewalls- but the SAS server can't see the Access file.]
04-21-2017 11:31 AM
One thing I've found helpful over the years is to do some testing with the DBMS native interface (whatever the SQL Server equivalent is to Oracle's PL*SQL. Try different extraction options in that, and see what kind of data you get. Monitor the bytes transferred over the wire. Then, when you run the same thing in SAS, do the same monitoring; in my experience, the bytes transferred should be reasonably similar.
If you can't get an efficient retrieval using the native SQL tool, I think you'll be totally out of luck with SAS.
04-21-2017 01:36 PM
Thanks @TomKari , makes sense. I managed to get SQL Server Mangament Studio set up on a windows server in the same environment as the linux SAS server, and have been doing some testing with that. It would be better for me to a use non-SAS method to test from the SAS linux server, but at least the network distance to the windows server will be the same.
When you say "monitor bytes transferred over the wire," do you monitor that with some SAS option? Or from database side? Or third party tool like https://www.wireshark.org/ ?
In general, when I'm confronted with what feels like a slow query, I often face the question of:
1. Is the query itself taking a long time to execute on the database?
2. Is the query taking a long time to return results over the wire?
3. Is SAS taking a long time to write the result to disk?
I don't think I've got a good handle on the best ways to separate those parts (paricularly #1 and #2), and I usually don't have a DBA who wants to help me look through database server logs, or a network admin. So I end up mucking around a bit with SAS options like readbuff and others until I find something that feels tolerable.
Looks like in SQL Server Management studio there is an option "discard query results after execution" which might be useful to me. So running a query with that option on might be a way to measure query execution time alone without passing results anywhere (i.e. #1 above).
04-21-2017 01:55 PM
Your list of 3 items seem great! But if I were to add a fourth item it would be to see if your query has any errors in join, or inefficiencies in their join. Your examples do not, but it may be helpful to add in the future. Two more things to add would be to 5. reduce the number of columns that are essentials and 6. create pre-filtered views in SQL Server so that less manipulation is required at the time of SAS pull.
I used to use SQL Management Studio every day at my last job, for about 8-9 years. I loved that tool! It made writing queries a breeze. I use PROC SQL for most of my data management tasks now because of the versatility [insert zoolander gif] of writing sql, which can be transferred to R, SPSS, MSSQL Studio, etc.
A question, how often do you execute this? If it is once per day, or once every few hours, have you considered a scheduled task via windows task scheduler and SAS at the command line? You could do this, overwrite the current file at each execution, and create a procedural method of archiving old datasets (just in case you would prefer to use a frozen dataset and not the "live" version.) This way, you would be less concerned about the time it takes to pull your dataset because it would effectively be pulled each day before you arrive at work, or back from lunch. Just a thought.
04-21-2017 02:56 PM
04-21-2017 03:07 PM
Ah, ok. Typically during the development phase you would want to pull and freeze the dataset in its entirety (if you can) onto your machine. That way you need not hit the database overmuch. You can do this either through MS SQL server via a temporary table (which automatically may create a local or global temporary table for you) or via SAS work library storage.
Also, an important distinction: A scheduled job is different than the scheduled task that I mentioned. You may already know this, but a scheduled job is executed via database administrative systems, and a scheduled task is something that you can create on your desktop as a means of executing command line executables at certain times.
04-21-2017 02:43 PM
Wow! That Wireshark looks like quite something!
Yes, something like that. All I want is to get an idea of how big is the result dataset, and how long did it take to come over the wire. I'm not that sophisticated, so I would just turn on Windows performance monitor, set it up to monitor the network traffic, and watch the size of the dataset grow in File Manager. Then, create a query in SSMS that will result in a result dataset being created on your local machine (you can pretty much always assume that the local machine writing to local disk will be less of a bottleneck than the network), and let her rip!
If you're waiting a long time and seeing no network traffic, you know it's the database server resolving your query.
If you see consistent low-level traffic, that results in about the right sized dataset, there might be something wrong with your network. Try just copying a really big dataset from your server to your local machine, and see what the difference in throughtput is.
If you see traffic right up to your network capacity, but nowhere near that much data is ending up in your result dataset, something weird is happening (I've never seen that).
In my experience, this doesn't have to be super sophisticated. "Back of the envelope" calculations have always been good enough for me.
04-21-2017 03:07 PM
Thanks Tom. Will give Windows performance monitor a shot. Haven't used it before.
That may also may come in handy for monitoring transfer speeds when my users run a stored process on the distant SAS server that has to return a big file to their browser. We've got some stored processes that basically just dump data to Excel and return it to the user. With the remote server, it's easy to see that the SAS job completes in a reasonable amount of time, but if it's returning a big file, the user can be twiddling their thumbs for a while waiting for it to download. Life was easier when we had a local server, sigh.
04-24-2017 10:08 PM
If you want to tell the difference between the time the SQL query runs versus the time to send over your remote network, just change your query from a CREATE TABLE to a SELECT count(*). This works well if you are using PASSTHRU.
Also just a thought with slow queries - is it possible to run several in parallel each one doing a different slice of your result set. Might be worth a test if you have the bandwidth, but each query is throttled back.