Is it possible to make proc SQL use the first two variables of a six variable composite index when making a left join?
I've tried the idxwhere and idxname options and it doesn't work. If i use the data step procedure with merge and by the index is used but not with proc sql. proc sql only uses indexes when i'm joining tables by the six variables of the composite index. If i'm only using the first two it doesn't work.
It would be very helpfull if that was possible since the two tables are N-N and since one of the tables already has a composite index containing the two variables i'm joining by and is already an extremely large dataset as is.
if you use the system option MSGLEVEL=i
the log might reveal what index is being selected
I know the where clause would take advantage of the left hand parts of the composite. Can't immediately imagine why a join would not.
Have you used the sql proc option _METHOD ?
I'm using the option msglevel=i, that is how i know that the index isn't beeing used.
I also used the _method option to see the steps SQL id doing. It is using SQXJM (data step join) and not SQXJNDX (Index join) as i wanted it to.
Bear in mind, that the use of an index is not always the best method to join data.
Index are quite good when you want to retrieve a small set of rows from a very large table.
Merging two big sets of data may be disastrous or at least less efficient (it actually depends on the type of data and how it sequentially appears in the table) than sort and merge join or any other method used by the SQL optimizer.
Beside that, there is some situations where the SQL optimizer cannot use any indexes.
You should check the method tree defined by the SQL optimizer.
In this case i think it is as i'm joining a 10GB indexed table against one which hasn't got even 1MB. Since this process is suposed to run every day it would be a huge performance improvement.
Not only it would be very usefull for this process that the left join could use the two variables of the composite index but for all processes that daily run here at Tranquilidade on our machine, as i'm sure you perfectly know!
I'll take a close look at your papers. Most of them i already read but maybe there is one i didn't that has the solution. I think this is more a problem of the SQL optimizer not realizing there is a composite index which the first two variables are the ones being joined, not even using idxwhere or idxname options.
Greetings from Portugal,
Message was edited by: dropez
One question is why do you have an composite index containing 6 columns? I can barley see in what situation you could make efficient use of a such index. Is it because you need a unique index for data integrity purposes? I can't remember any situation where I needed more than three columns in a index for query optimization.
Even if you need the original index, investigate how much overhead a second index would cost. It would be probably quite small compared to the original one. If you are using SPDE or SPDS engine there should be very little overhead during update.
It is basically due to population processes. Every time i insert a new record in the datatable, wich contains dozens of other variables, i've to search if that record already exists in the datatable and change a few variables in that pre-existing record. This is done with thousand of records on a daily basis against hundreds mother tables of at least 3GB of data each. Also i need to have the table optimized so that users with no experience in encoding to be able to search data by unique keys with the proper software.
Of course it is a solution to create a new composite index containing only those two variables. The only problem with that is an increase of 1GB in the index.
I was trying to avoid that solution if there was a possibility of SQL to use the index that already exists, just as the data step merge does.
Are we talking of a Slowly Changing Dimensions process, or are you trying to update some kind of fact table measures?
Either way, and with that volume of data I would forget any SQL implementation.
Datastep will give you total control of the process.
SQL has its caveat (has SAS does), one of them may be the SQL optimizer.
If we're talking about a huge size table vs a tiny one scenario, have you tried the hash approach?
But there's many other solutions, like merging by formats, or even perform an explicit index match as you wish. And yes, the N-to-N relation can actually be handled with datastep.
And hello to the nice people at Tranquilidade.
Been there myself a few years ago.
Neither one of them. I need to join the small dataset with the large dataset in order to get a few variables from that large dataset. I'm using a pre-defined macro that uses proq SQL left join to do the cartisian product. If the variables i'm joining are all from the index, proc SQL uses the index to make that join. If it are only a few of the index variables it doesn't as i previously said. The quest here is to optimize proc SQL to use sub-index variables if possible . I'm starting to feel that it isn't.
I think that the hash aproach is not possible to do directly since i've got duplicate keys in the small dataset. I can first eliminate the duplicates, use the hash aproach to reduce the results dataset and finally use proc SQL to do the cartisian product with the original small dataset. If efficient, this aproach is only good for this particular case. I've tried that aproach globally for all processes that use those macros and most of the time it's more eficient to just use proc SQL directly for cartisian product instead of those three steps i mentioned.
About handling it with a datastep, are you talking about using _iorc_, _dsenom, etc approach? that is only possible if there is an index by those two variables i'm joining by is it not?
Format merging... i've seen something like that in a code somewhere in our repository, probably your code, but i'm not very familiarized with it!
Not knowing the particular details of your process, I would say, and if and only if, you are dealing with a small dataset vs huge dataset, format match will probably give you the best result, and without the need of a complex coding.
Basically, I'm suggesting loading the small dataset into a format and perform a single passage through the huge dataset matching the key with the format and outputting to a third dataset if the match is successful.
Technically, the basis of this technique is similar to hashing, it involves a single passage to one of the table while the other resides in memory.
But, although Hashing with duplicate keys is natively supported with SAS 9.2, the implementation is also possible in 9.1, through the Hash iterator and/or using multiple Hash objects.
Index match will also be possible, but I'm not talking about using an index on the large table, instead I would use an index in the smaller table. And yes, there is actually a technique that permits index merge with n-to-n tables (by purposely failing the match to reset the index).
But no need here for neither of them, I think good old format match will suffice.
If you would care about considering this approach, I'm sure you have there some good fellows who could help you with the concepts of format match.
depending on the size of small, you might achieve a hash join in SQL.
Paul Kent's TS note about SQL iirc describes how the buffersize option increases the size of the small table that can be accomodated. http://support.sas.com/techsup/technote/ts553.html ..