Hello! Does anyone know if there is a book or area I can search that will explain the relationship between SAS and sql server tables pulled through a libname odbc connection?
FOr example, if I had the code:
libname mine odbc dsn='mydsn' schema=dbo bulkload=yes dbmax_text=32000;
data answer;
set mine.Table1;
Where module="Module 1";
run;
I would expect it to run faster if the column "module" had an index on it compared to it not having an index on it. In my own testing however, I cannot determine a noticable difference in the run speed, but that might be due to the fact that the table only has ~600,000 rows on it.
I am just curious if Base SAS actually takes indexes into consideration when pulled through a libname statement, and or how the libname statements treats these!
Thanks!
Brandon
If you have access to your DBMS-specific query tool (not SAS, but a stand-alone query tool), you might try running the query there. Most tools have ways to tell you the query plan that will tell you whether it will use an index or indexes, and which ones. I've worked with both Oracle and Microsoft SQL Server, and whenever I had a question like yours, I'd use the query tool to check the query plan and find out if the index was being used.
Does changing the message level help in this case? I know it will for SAS tables, but not sure about pass through as well.
options msglevel=i;
Hey Reeza, thanks for your help. This actually didn't change anything in the log when I specified this option. Not sure if it was supposed to tho (this is the first time i've ever heard of the msglevel option, and the online info on it wasn't totally clear in my opinion).
You didn't include the dbindex= option thus probably never got the index. When I googled it I found the following v8 description: ODBC Chapter, First Edition : LIBNAME Statement: ODBC Specifics
What run speed changes did you expect to find? Most of the speed would be limited by the speed that SAS can pull the data from the database and write it to the disk. How long it takes the underlying database to locate the records would probably be swamped by those other factors.
Have you tried using a libname that does not specify bulkload? There may be an interaction between this option and the data read. Bulkload is only of use in specified conditions when writing to an external table, and probably suspends indexing while loading. It is at best redundant when reading data. If you need it for writing to the table I would suggest using a different libname for the load.
The external database optimiser may also have an impact on whether the index is used. If the number of records that satisfy your where condition is above a certain threshold the engine may decide it is more efficient to extract the whole table than to use the index to hunt for particular rows.
Richard temporarily back in Oz
There are options to dump to the log the SQL code which SAS uses:
options sastrace=",,,d" sastraceloc=saslog;
(Yes, that's 3 commas followed by the letter d.)
This should show the exact SQL code which SAS passes through to the database, which may very well include indexing statements.
Hello Tom. This appears to be a general pattern in responses that I recieve (mostly from you, but it also appears to be a pattern online). Can you elaborate more onto what are the major factors that influence how quickly SAS can pull the data and write it to disk? Is this primarily memory restricted, network restricted, etc?
I know for a fact my companys network is...... lets go with "not great" haha. Furthermore I am only working on a machine with 8 Gigs of ram total.
Also, when I run this code andrew I get the sql code that is run, (it's a simple select X,X1,X3 from Y1 where Z1="") however how would that help in knowning if it is using the index correctly? I guess technically since I am pulling from a column that has an Index on it it "should" be using it, but I am not sure that it is.
Thanks again everyone!
"how would that help in knowning if it is using the index correctly? I guess technically since I am pulling from a column that has an Index on it it "should" be using it, but I am not sure that it is."
By knowing what query is passed, you can make an educated guess about whether an index would be used. Since the parameter is being correctly passed to the DBMS, you know that at this point the DBMS query engine has the info it needs to potentially use the index. Now it's up to the query engine to decide whether or not to use it. You can read entire books on indexing and optimization.
There are many queries that SAS has a hard time converting into SQL that can be processed by the DBMS -- in those cases, SAS pulls down the entire contents of the table and processes the query locally. You want to avoid this, except on small tables where it doesn't really matter.
BTW, IMHO text strings don't make great indexes unless they are very highly selective -- the usual rule of thumb is indexes are useful if they return less than ~15% of the data, but with a text string, I'd use a rule of thumb more like less than 5%. So something like a person's last name is good, but something like college major where you have about 30 different values, that's not so great. This is starting to dip into that huge topic of indexing.
If you have access to your DBMS-specific query tool (not SAS, but a stand-alone query tool), you might try running the query there. Most tools have ways to tell you the query plan that will tell you whether it will use an index or indexes, and which ones. I've worked with both Oracle and Microsoft SQL Server, and whenever I had a question like yours, I'd use the query tool to check the query plan and find out if the index was being used.
Haha good timing, I just read your previous post after this one, but we're saying the exact same thing.
That is kinda what I was expecting the answer to be. I'm going to have to assume it is using the Index then because the exact same sql query does use the index so that's how I'll have to treat it moving forward.
Also, the index does break the data into about ~1-2% distinct groups (there are over 100 distinct text values, of differing sizes) so it actually does dramatically speed up queries. Also it's the only logical breakdown in the data above the indivdual Loan Level, which I can't filter on by definition.
Thanks for the advice however!
Yep, I'd say that's a safe assumption that it is using the index. If it's slow, then you have to figure other factors like network speed are causing the issue.
This is a long shot, but I'll try it anyway. If you are using select * instead of listing the columns you need, you might get a huge improvement listing the columns. If network speed is your bottleneck, you could get a huge improvement by virtue of sending less data over the wire.
After pressing "Add Reply" I realized I should look at your original query, and I see you are using a datastep with no keep and no drop.
So I strongly recommend that you add a keep statement and add just the columns you need. This can produce a huge improvement, particularly if the table contains a lot of variables, or if the table contains one HUGE text field that you never use (my company has a lot of cases like that).
To have an effect, the keep statement needs to be a dataset option like this:
data answer;
set mine.Table1 (keep=module myvar1 myvar2 myvar3);
Where module="Module 1";
run;
Hello again smiller!
This is actually just a theoritical problem calling an imaginary table. In practice I only keep the columns I need related to the query, I was just curious as to if there was some kind of log or variable created within the SAS system that said "index XYZ was accessed" etc..
Also, I KNOW that network speed is my largest bottle neck by a simple experiment. We created a VM that was in the same location (as in Geographic location) as the servers the databases were stored on, and ran the same query, and it took 4 seconds. Over our network it took 5+ minutes.
So yeah, this was more of a theory craft question not related to performance in general, but how the inner workings of SAS work (many of my questions are).
My comment about the performance not being distriguishable was tested on the VM, and it took 4.01 seconds to 4.03 seconds (averages) but a t tests on the multiple tests showed the means weren't statistically different. Anyways, i'm pretty confident that the 4 seconds it is taking is again a bottle neck on the network (since techincally it still has to go over the network on the vm).
Anyways, your solution of just running the query in sql server using the index tools is what I plan on doing going forward.
It depends on several things
I don't think that SAS decides weather to use index on a DBMS table or not. In my opnion it's a decision made by the DBMS. ODBC is great, especially if you use more than one DBMS, but my experience is that it is difficult to "fine-tune" your I/O using ODBC. In order to do so you should use a DBMS specific engine instead.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
