Hello,
we have a quite large dataset which we generate in SAS but need to store as MS SQL table. The dataset has 39 columns and around 140 million rows. Generating it in SAS takes around 40 minutes, which is long but manageable. Loading it to SQL via ODBC bulkload, however, takes close to 3 hours. We are currently dropping the SQL table and creating it fresh via proc sql drop table/create table. Since most of the data does not change from run to run, could performance be gained by doing an incremental update? If so, how one would approach this? Running proc compare is unlikely an option.
Hi,
It may be quicker to create your delta on SAS and then apply it to the MS SQL table instead of always dropping and rebuilding the entire table.
You can see an example of update/insert code here Solved: SQL Server pass through (merge, update, insert) - SAS Support Communities
Regards
There are several things that can affect you performance:
- SQL Server resources
- SAS server resources
- Bandwidth
- Method of detecting changes. If you have/can create an update timestamp or similar, I would upload the subset to SQL Server, and perform UPDATE/INSERT or MERGE there. Make sure you have relevant indexes for you key.
Is it possible to just do the entire operation through your ODBC connection to SQL Server, just using execute() statements to submit the native T-SQL? If so, that will almost certainly be your fastest option, as it avoids any moving of data back and forth.
Can you please answer below questions. We must know in order to propose something realistic.
1. How much memory is available to your SAS session? Please share the log output of below.
proc options group=memory;
run;
2. Do you already know the delta (update/insert) on the SAS side or do you have to work it out by comparing with the SQL table?
3. Are there also deletes? And do you know on the SAS side or do you have to work it out by comparing to the SQL table?
4. Is your SAS table already sorted by a unique key?
5. What's the size of the unique key? If multiple columns just the sum of the lengths of the SAS variables that constitute the key.
6. Are you allowed to store SAS tables permanently on disk or must all permanent tables reside on the SQL server side?
Group=MEMORY
SORTSIZE=2147483648
Specifies the amount of memory that is available to the SORT procedure.
SUMSIZE=0 Specifies a limit on the amount of memory that is available for data summarization procedures when class
variables are active.
MAXMEMQUERY=0 Specifies the maximum amount of memory that is allocated for procedures.
MEMBLKSZ=16777216 Specifies the memory block size for Windows memory-based libraries.
MEMMAXSZ=2147483648
Specifies the maximum amount of memory to allocate for using memory-based libraries.
LOADMEMSIZE=0 Specifies a suggested amount of memory that is needed for executable programs loaded by SAS.
MEMSIZE=4294967296
Specifies the limit on the amount of virtual memory that can be used during a SAS session.
REALMEMSIZE=0 Specifies the amount of real memory SAS can expect to allocate.
We can probably request to have a bit more memory if needed, I believe the VM has 64 GB.
I do not know the delta, I would have to compare it with SQL. Alternatively, I could compare it with the dataset stored in the permanent lib which is currently getting copied to MSSQL, but I would need to generate the update to another lib or rename it.
Deletes are very infrequent but cannot be excluded entirely.
Sorting by unique key is not implemented yet but could probably be done without too much effort. The dataset is partially sorted already due to several merge statements.
Unique key would add up to 195 characters at the moment, I guess some sort of hash might help reduce this.
I am currently storing the final dataset on disk. The current process is to generate the final dataset in SAS, store in in a permanent "staging" library, and subsequently copy it to MSSQL.
The overwhelming majority of updates are inserts. Updates are far and between, and deletes are even rarer than that.
Regarding the size, as an example, the latest update would be approx. 2M new rows.
Storage in SQL is done for sharing with others, yes. Adding new licenses is not an option unfortunately.
If your source of truth is a SQL table and you don't get the info for insert, update and deletes already from source, then you need to work it out by comparing your newest data to what's already in SQL.
One way for full load could be:
In SAS create and populate a hash/digest value over all columns that need update if they change and add this new column to your SQL table.
On the next day download from the SQL table the primary key columns and the hash column to SAS and compare to your new data (where you also create this hash).
Inserts: Primary key exists in the new SAS table but not in the SQL table
Update: Primary key exists in both tables but hash values differ
Deletes: Primary key exists in the SQL table but not in the new SAS table
Given what you describe what I'd likely do next to keep coding simple:
Deletes are slow but given your statement that Deletes and Updates are rather rare, keeping the code simple for the Updates is likely worth it.
The reason why I've asked for the size of the Primary key column(s) and the available memory:
Given the size of your data you will want to avoid sorting. IF the Primary Key and the Digest value fit into memory then you could identify the delta (I, U, D) withing a single data step using a hash lookup table and without the need for the SAS table to be sorted. The hash lookup table would contain the Primary Key column(s) and the digest value loaded from SQL.
....and you could of course also create a hash value for the primary key columns should they be too wide. Having said that: With your volumes I would eventually not use md5 to hash the primary key but at least sha1. Collisions are rare but I've seen it once happen with md5 and with a lower data volume than what you've got.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.