Re: Improving performance writing data from SAS to SQL Server

ScottBass · Posted 03-24-2020 04:45 PM

I have a process where I am:

reading 67M rows from SQL Server to SAS; transforming the data using SAS; writing 67M rows from SAS to SQL Server.
no rows are dropped or created in the transformation processing. it's just fixing some incorrect values in historic data.
the source and target tables have identical schema
the target table is a heap (no indexes, keys, constraints)

The read operation is taking approx. 5 minutes; the write operation is taking approx. 5 hours. So, yeah 😬😱😠

Here are some excerpts from the log:

Libname statement:

27         %libname_sqlsvr(libref=PROD,server=MYSERVER,port=,database=MYDB,schema=prod)
=================================================================================
LIBREF:           PROD
CONNECT:          NOPROMPT="Driver={SQL Server Native Client 
10.0};Server=MYSERVER;Database=MYDB;Trusted_Connection=yes;"
INTERNAL OPTIONS: schema=prod bulkload=yes dbcommit=100000 direct_exe=delete preserve_names=yes
USER OPTIONS:
=================================================================================

ODBC: AUTOCOMMIT is NO for connection 3
MPRINT(LIBNAME_SQLSVR):   libname PROD odbc NOPROMPT="Driver={SQL Server Native Client 
10.0};Server=MYSERVER;Database=MYDB;Trusted_Connection=yes;" schema=prod bulkload=yes dbcommit=100000 
direct_exe=delete preserve_names=yes ;
NOTE: Libref PROD was successfully assigned as follows: 
      Engine:        ODBC 
      Physical Name:
      
Read data from SQL Server to SAS:

27         * Copy data from SQL Server ;
ODBC: AUTOCOMMIT is NO for connection 6
 
ODBC_28: Prepared: on connection 6
SELECT * FROM "prod"."RLDX_DEMOGRAPHIC_ORIG"
 
28         data workspde.RLDX_DEMOGRAPHIC;
29            set prod.RLDX_DEMOGRAPHIC_ORIG;
30         run;

 
ODBC_29: Executed: on connection 6
Prepared statement ODBC_28
 
NOTE: There were 67,766,170 observations read from the data set PROD.RLDX_DEMOGRAPHIC_ORIG.
NOTE: The data set WORKSPDE.RLDX_DEMOGRAPHIC has 67,766,170 observations and 34 variables.
NOTE: Compressing data set WORKSPDE.RLDX_DEMOGRAPHIC decreased size by 75.73 percent.
NOTE: DATA statement used (Total process time):
      real time           5:05.05
      user cpu time       6:15.26
      system cpu time     57.09 seconds
      memory              1401.65k
      OS Memory           22828.00k
      Timestamp           25/03/2020 07:17:50 AM

Write data from SAS to SQL Server:

27         proc append
 
ODBC_24: Prepared: on connection 6
SELECT * FROM "prod"."RLDX_DEMOGRAPHIC" WHERE 0=1 
 
ODBC: AUTOCOMMIT is NO for connection 7
ODBC: COMMIT performed on connection 7.
 
ODBC_25: Prepared: on connection 7
SELECT * FROM "prod"."RLDX_DEMOGRAPHIC"
 
28            base=prod.RLDX_DEMOGRAPHIC (
29               bulkload=yes bl_options=tablock
30            )
31            data=workspde.RLDX_DEMOGRAPHIC6 (
32               keep=
33         rldx_record_key
34         rldx_audit_key
35         recnum
36         stay_number
37         episode_sequence_number
38         hospital_type
39         source
40         hoscode
41         tfrhosp
42         transfrom
43         mrn
44         surname
45         givnames
46         sex
47         dob
48         age
49         address
50         locality
51         pcode
52         cob
53         admdate
54         sepdate
55         dthdate
56         notifdate
57         patientnum
58         supi
59         moh_auid
60         AHS_hosp
61         dc_extract_date
62         cob_sacc
63         state_of_usual_residence
64         rldx_valid_from
65         rldx_valid_to
66         rldx_current_ind
67            );
68         run;

NOTE: Appending WORKSPDE.RLDX_DEMOGRAPHIC6 to PROD.RLDX_DEMOGRAPHIC.
NOTE: There were 67,766,170 observations read from the data set WORKSPDE.RLDX_DEMOGRAPHIC6.
NOTE: 67766170 observations added.
NOTE: The data set PROD.RLDX_DEMOGRAPHIC has . observations and 34 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           5:34:15.35
      user cpu time       18:04.33
      system cpu time     37.90 seconds
      memory              634.87k
      OS Memory           21804.00k
      Timestamp           25/03/2020 03:34:13 AM

The write operation is a TRUNCATE TABLE (not shown) and PROC APPEND approach.

I note this previous thread: https://communities.sas.com/t5/General-SAS-Programming/Get-data-into-sql-server-more-efficiently/td-...

I will continue playing with the libname and/or dataset options, but at 5 hours a pop for performance metrics, I thought I'd post to see if someone has further ideas. For example, is it possible that bulkloading needs to be tweaked on SQL Server itself (not that I'd know how to diagnose that, and I'm not the DBA).

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

ballardw · Posted 03-24-2020 04:55 PM

It may help to be a bit specific as to the types of data changes you are making.

For instance, does every single row require a change? If not, can you extract just those and then update the source table.

The type of transform may be amenable to in database code and cut out the steps of download/upload which I suspect is taking a moderate amount of your time.

ScottBass · Posted 03-24-2020 08:41 PM

The code is doing about 7 distinct transformations to fix data errors from historic job runs.

The more complex transformations require knowing data from the previous row (eg. lag) and first. and last. processing. While doable in SQL Server, they are just easier in SAS.

For instance, does every single row require a change? If not, can you extract just those and then update the source table.

No, not every row required a change, but every row might require a change. But I don't know which rows require a change until I do the SAS processing, especially transformations that require logic across rows as described above.

I could review my code, determine which columns are actually updated, and refactor my code to only download those columns (and supporting columns), then upload that narrower table and run a series of UPDATE statements on SQL Server.

But programmer time is often more critical than machine time.

For the purpose of this post, I'm just interested in improving the write performance of SAS to SQL Server. I've built a test suite testing the dataset options bulkload=yes, bl_options=tablock, dbcommit=, insertbuff=, and comparing proc append to an existing table vs. a data step writing a new table. I won't have those metrics until the end of day if not overnight. But early results on a small subset of the data (10M rows) seems to indicate that increasing dbcommit= looks promising.

I'll post a summary of those results once I have them.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

ScottBass · Posted 03-24-2020 10:46 PM

These aren't the results of my test suite, but preliminary results showed promise by increasing dbcommit=.

I re-ran my transformation processing, increasing dbcommit to 1B (1,000,000,000) from the previous value of 100K (100,000). 1B is much larger that the source dataset. AFAIK this change reduces transaction log overhead on SQL Server.

The load time reduced from 5 hours to 30 mins.

Excerpt from the log:

671        proc append
ODBC: AUTOCOMMIT is NO for connection 5
 
ODBC_6: Prepared: on connection 5
SELECT * FROM "prod"."RLDX_DEMOGRAPHIC" WHERE 0=1 
 
ODBC: AUTOCOMMIT is NO for connection 6
ODBC: COMMIT performed on connection 6.
 
ODBC_7: Prepared: on connection 6
SELECT * FROM "prod"."RLDX_DEMOGRAPHIC"
 
672           base=prod.RLDX_DEMOGRAPHIC (
673              bulkload=yes bl_options=tablock dbcommit=1000000000
674           )
675           data=workspde.RLDX_DEMOGRAPHIC6 (
676              keep=
677        rldx_record_key
678        rldx_audit_key
679        recnum
680        stay_number
681        episode_sequence_number
682        hospital_type
683        source
684        hoscode
685        tfrhosp
686        transfrom
687        mrn
688        surname
689        givnames
690        sex
691        dob
692        age
693        address
694        locality
695        pcode
696        cob
697        admdate
698        sepdate
699        dthdate
700        notifdate
701        patientnum
702        supi
703        moh_auid
704        AHS_hosp
705        dc_extract_date
706        cob_sacc
707        state_of_usual_residence
708        rldx_valid_from
709        rldx_valid_to
710        rldx_current_ind
711           );
712        run;

NOTE: Appending WORKSPDE.RLDX_DEMOGRAPHIC6 to PROD.RLDX_DEMOGRAPHIC.
NOTE: There were 67,766,170 observations read from the data set WORKSPDE.RLDX_DEMOGRAPHIC6.
NOTE: 67766170 observations added.
NOTE: The data set PROD.RLDX_DEMOGRAPHIC has . observations and 34 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           30:38.34
      user cpu time       18:12.94
      system cpu time     41.20 seconds
      memory              685.25k
      OS Memory           22836.00k
      Timestamp           25/03/2020 01:11:54 PM

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Patrick · Posted 03-25-2020 12:52 AM

I very often set dbcommit to zero meaning that there is only a single commit after all data is loaded. I'm doing this not only for performance but also because I very often prefer an all or nothing approach and though never have to deal with partially loaded tables.

If going for more than one commit then disabling indexes during the load process should also speed-up processing.

LinusH · Posted 03-25-2020 08:21 AM

I don't know if this is applicable for your case, but in our project there was one libname option that really made a difference:

CONNECTION=UNIQUE.

Not sure if this has effect on bulk load scenarios though.

Data never sleeps

ChrisNZ · Posted 03-25-2020 03:45 AM

You could ask the SQL DBA is they see anything interesting in their log.

High-Performance SAS Coding - Third Edition

Ksharp · Posted 03-25-2020 06:43 AM

Hi, Did you try INSERTBUFF=10000 options ?
libname PROD odbc ........... insertbuff=10000 readbuff=10000 ;

or try explicit Pass Through SQL like:

connec to odbc
execute( create table want (id num,sex char(20) .......) )
by odbc;

insert into want
select * from have;
quit;

dhavalyparikh · Posted 05-03-2024 01:29 PM

Hi Scott - Did you figured out the performance issue pushing update from, SAS to SQL table having identical values almost 30K rows and 100+ columns !! I am having exactly same issue-thing is update does works if it's specific list of it took time and update but when I run for complete data set not sure it's been already 1 hours + and still going - any hint clue or some better alternative !

SASKiwi · Posted 05-03-2024 07:12 PM

In my experience you need to change the INSERTBUFF and DBCOMMIT settings to improve table loading performance with SQL Server. Start with INSERTBUFF = 10000 and DBCOMMIT = 0, then try higher values of INSERTBUFF to see if it loads even faster.

Patrick · Posted 05-03-2024 09:00 PM

@dhavalyparikh

As @SASKiwi suggested try with dbcommit=0 and insertbuff= to something significantly higher than the default of 1.

Which value for insertbuff performs best depends on your environment so test with a few different ones to find the soft spot for your environment and table.

SAS Training: Just a Click Away