Re: Incremental update to a MS SQL table via ODBC

js5

Hello,

we have a quite large dataset which we generate in SAS but need to store as MS SQL table. The dataset has 39 columns and around 140 million rows. Generating it in SAS takes around 40 minutes, which is long but manageable. Loading it to SQL via ODBC bulkload, however, takes close to 3 hours. We are currently dropping the SQL table and creating it fresh via proc sql drop table/create table. Since most of the data does not change from run to run, could performance be gained by doing an incremental update? If so, how one would approach this? Running proc compare is unlikely an option.

MrSantos

Hi,

It may be quicker to create your delta on SAS and then apply it to the MS SQL table instead of always dropping and rebuilding the entire table.

You can see an example of update/insert code here Solved: SQL Server pass through (merge, update, insert) - SAS Support Communities

Regards

LinusH

There are several things that can affect you performance:

- SQL Server resources

- SAS server resources

- Bandwidth

- Method of detecting changes. If you have/can create an update timestamp or similar, I would upload the subset to SQL Server, and perform UPDATE/INSERT or MERGE there. Make sure you have relevant indexes for you key.

Data never sleeps

quickbluefish

Is it possible to just do the entire operation through your ODBC connection to SQL Server, just using execute() statements to submit the native T-SQL? If so, that will almost certainly be your fastest option, as it avoids any moving of data back and forth.

Patrick

Can you please answer below questions. We must know in order to propose something realistic.

1. How much memory is available to your SAS session? Please share the log output of below.

proc options group=memory;
run;

2. Do you already know the delta (update/insert) on the SAS side or do you have to work it out by comparing with the SQL table?

3. Are there also deletes? And do you know on the SAS side or do you have to work it out by comparing to the SQL table?

4. Is your SAS table already sorted by a unique key?

5. What's the size of the unique key? If multiple columns just the sum of the lengths of the SAS variables that constitute the key.

6. Are you allowed to store SAS tables permanently on disk or must all permanent tables reside on the SQL server side?

js5

Group=MEMORY
SORTSIZE=2147483648
Specifies the amount of memory that is available to the SORT procedure.
SUMSIZE=0 Specifies a limit on the amount of memory that is available for data summarization procedures when class
variables are active.
MAXMEMQUERY=0 Specifies the maximum amount of memory that is allocated for procedures.
MEMBLKSZ=16777216 Specifies the memory block size for Windows memory-based libraries.
MEMMAXSZ=2147483648
Specifies the maximum amount of memory to allocate for using memory-based libraries.
LOADMEMSIZE=0 Specifies a suggested amount of memory that is needed for executable programs loaded by SAS.
MEMSIZE=4294967296
Specifies the limit on the amount of virtual memory that can be used during a SAS session.
REALMEMSIZE=0 Specifies the amount of real memory SAS can expect to allocate.

We can probably request to have a bit more memory if needed, I believe the VM has 64 GB.

I do not know the delta, I would have to compare it with SQL. Alternatively, I could compare it with the dataset stored in the permanent lib which is currently getting copied to MSSQL, but I would need to generate the update to another lib or rename it.

Deletes are very infrequent but cannot be excluded entirely.

Sorting by unique key is not implemented yet but could probably be done without too much effort. The dataset is partially sorted already due to several merge statements.

Unique key would add up to 195 characters at the moment, I guess some sort of hash might help reduce this.

I am currently storing the final dataset on disk. The current process is to generate the final dataset in SAS, store in in a permanent "staging" library, and subsequently copy it to MSSQL.

LinusH

Have assessed the percentage of rows that usually is getting updated? If it's too high, incremental update might not much faster.
Do you store the data in SQLS for to make it available for a wider audience? If you have have (or could add) a license for SHARE*NET you could make your SAS dataset available directly to users via ODBC.
Another option could be if you have access to a Hadoop cluster, load the file there, and have SQLS use that as an external table.

Data never sleeps

js5

The overwhelming majority of updates are inserts. Updates are far and between, and deletes are even rarer than that.

Regarding the size, as an example, the latest update would be approx. 2M new rows.

Storage in SQL is done for sharing with others, yes. Adding new licenses is not an option unfortunately.

LinusH

Then this should be feasible. The inserts can as well use bulk-load.

Data never sleeps

Patrick

If your source of truth is a SQL table and you don't get the info for insert, update and deletes already from source, then you need to work it out by comparing your newest data to what's already in SQL.

One way for full load could be:

In SAS create and populate a hash/digest value over all columns that need update if they change and add this new column to your SQL table.

On the next day download from the SQL table the primary key columns and the hash column to SAS and compare to your new data (where you also create this hash).

Inserts: Primary key exists in the new SAS table but not in the SQL table

Update: Primary key exists in both tables but hash values differ

Deletes: Primary key exists in the SQL table but not in the new SAS table

Given what you describe what I'd likely do next to keep coding simple:

Add to the SAS table a column "action" with values I, U or D
Upload (bulk or not) the data with an action into a SQL staging table
Delete in your SQL table all rows from the staging table with a matching primary key and action D or U
Insert into your SQL table all rows from the staging table with action U or I

Deletes are slow but given your statement that Deletes and Updates are rather rare, keeping the code simple for the Updates is likely worth it.

The reason why I've asked for the size of the Primary key column(s) and the available memory:

Given the size of your data you will want to avoid sorting. IF the Primary Key and the Digest value fit into memory then you could identify the delta (I, U, D) withing a single data step using a hash lookup table and without the need for the SAS table to be sorted. The hash lookup table would contain the Primary Key column(s) and the digest value loaded from SQL.

....and you could of course also create a hash value for the primary key columns should they be too wide. Having said that: With your volumes I would eventually not use md5 to hash the primary key but at least sha1. Collisions are rare but I've seen it once happen with md5 and with a lower data volume than what you've got.

Incremental update to a MS SQL table via ODBC