I'm having a hard time understanding sas behaviour while loading indexed tables.
I've got an empty indexed master table where i need to load 35GB of data.
Every row of data may update a previous loaded row so it's essencial that i can fast lookup and update records already loaded.
From my knowledge, the best way to do this is to have the master table indexed by the lookup key and use key search to do the job.
The problem is that this load took nothing less than 14hrs to run!!! The more records the master table has, the slower the loading becomes. At end the process was loading less than 1MB per second. It almost looks like there are no indexes being used.
After the loading is complete the index is 8GB and master table 28GB long.
The code i'm using is the following:
set lib2.indata(rename=(var1 = in_var1 var2=in_var2 ... varn=in_varn)
modify lib1.master key=master_key / unique;
when (%SYSRC(_SOK)) do;
if inst= 'D' then REMOVE;
when (%SYSRC(_DSENOM)) do;
put 'ERR' 'OR: Unknown IO ';
Is there something wrong with this code or is my problem else where?
The process is running on a HPUX server with 5 1.8Ghz cpu's and 30 GB of memory, the indata is coming from a RAID 10 device and master table is in a different RAID 10 device.
The OS is a HP-UX 11.31 VSE OE.
Any help on this would be very welcome as this is a huge botleneck in processing.
I've not used the SAS index, but it looks like you are building the index on lib1.master as you are building the data set itself. In Oracle and Sybase, this creates an "index" that is essentially a sequential table consisting of the key and a pointer and gets quite slow. The reason is that when you add a record to an indexed table, the index entry for that record gets added to the end of the index area and is searched sequentially. That is why the database products recommend that the DBA drop and rebuild indexes periodically. This note might help http://support.sas.com/kb/23/196.html
In the end, however, it may well be quicker to sort the data and build the output file sequentially.
To understand your performance problems you need to know that if you are updating an indexed table, every time you insert or delete a row the index must be modified to be kept up-to-date.
Since you are starting with an empty table, you are actually modifying the index more times than there are rows in the entire table! I am not surprised performance is so slow. It is standard practice for DBAs (Oracle, Sybase etc.) if they are doing a complete table load, they drop indexes first, do the load and then put the indexes back on.
I suggest you look at sorting the data in order of the key variable(s), followed by the earliest to latest transactions for each key. With some FIRST. and LAST. processing with a SET statement and a BY statement I suspect you can get what you want much faster.
I understand how 35GB of transaction data can become (unmanageable = impossible to sort).
Rather than randomly update the master which starts empty, building and re-building indexes as you go through the transaction data, create your "master" in stages, from manageable (I mean sortable) sections of your input. You probably know what volume of transaction data is the manageable limit - so use a quarter of that as the "stage size".
Sort a stage of transactions into the required order for the master. Update the master for the stage without using or building indexes.
After the last stage has updated the master, you may decide whether and how many indexes to create.
[although indexes are great for retrieving data through joins and for where sub-setting, indexes are only good for updates in _very_ small volumes and then "small" relates to the amount of free memory available when your master index is loaded, along with the normal activity on your system]
I really wasn't seeing the full picture, specially that part of the index being rebuilt at every insert.
I now use indexes to delete or update records that match and afterwards, i drop all indexes, append the nonmatch records and rebuild the indexes.
This solution as droped my processing time to less than 1/5th of the original time, which is very acceptable considering this was about the first load of the master table, and the next loads will have a much shorter range of data.
Thanks for all the tips.
Message was edited by: dropez