Upload to Databricks via ODBC is very slow

js5

Hello,

we have been experimenting with uploading results generated with SAS to a Databricks SQL warehouse. The code looks as follows:

libname dbx odbc prompt="Driver={Simba Spark ODBC Driver};
Host=foo.cloud.databricks.com;
Port=443;
HTTPPath=/sql/1.0/warehouses/bar;
SSL=1;
ThriftTransport=2;
AuthMech=3;
UID=token;
PWD=baz;
Catalog=foofoo;
Schema=barbar;
DefaultStringColumnLength=32767"
dbcommit=10000
insertbuff=10000
readbuff=1000
dbcreate_table_opts="TBLPROPERTIES('delta.columnMapping.mode' = 'name', 'delta.checkpoint.writeStatsAsStruct' = 'false', 'delta.autoOptimize.optimizeWrite' = 'true')"
preserve_col_names=yes;

proc sql;
create table dbx.test as
	select * from test;
run;

The performance leaves a lot to be desired unfortunately. As an example, a dataset with 122 variables and 127 thousand observations takes between five and six minutes to upload, whereas it takes less than 20 seconds to upload to a Microsoft SQL Server DB (also via ODBC).

Is this expected? According to the documentation, Databricks can also be accessed via JDBC and Spark SAS ACCESS modules, but we unfortunately do not have them licensed. Are there any options which could improve ODBC upload performance? I tried increasing insertbuff and dbcommit further, but I get the following error when going above 12k:

ERROR: CLI execute error: [Simba][Hardy] (130) An error occurred while an INSERT statement which causes the driver to reconnect to the server.

Thanks for your help in advance!

LinusH

There are a lot of parameters that can affect performance in such use case.

If few questions:

- Where does your comparison SQL Server reside (vs Databricks)?

- Have you tried to load this data by other means into Databricks? Locally (to rule out actual load problems), remotely (compare with SAS).

Some hints might be given if you active tracing:

options sastrace=',,,d' sastraceloc=saslog nostsuffix msglevel=i;

Data never sleeps

js5

The architecture is somewhat complex due to historical reasons:

the library I am copying from is locaten on an on-prem network share
SAS runs on AWS EC2
SQL Server runs on an on-prem VM
Databricks runs on AWS

Creating the table directly in databricks (CTAS from the SQL Server table connected as catalog) takes two minutes. So still much longer than creating a SQL Server table, but nevertheless two to three times faster than creating it via ODBC.

Ksharp

Did you try to use PROC COPY instead of PROC SQL?

proc sql;
create table dbx.test as
	select * from test;
run;
------------>
proc copy in=work out=dbx;
select test;
run;

js5

There is no meaningful difference when using proc datasets. This is likely explained by the fact that it still uses SQL in the background according to the ODBC trace.

SASKiwi

What is the download performance for the same table? If both are taking a similar amount of time then I suggest that it is your network connection to Databricks that is the problem.

js5

Download performance is not an issue. If it does not fail due to insufficient memory (I need to investigate why this happens), the entire table is downloaded in around 10 seconds.

SASKiwi

In my experience tuning INSERTBUFF works for SQL Server but it looks like to be not the case with Databricks. Have you talked to the Databricks DBAs to see if they have any ideas about the slowness? It might be worth trying bulk loading to see if that helps.

Upload to Databricks via ODBC is very slow

Re: Upload to Databricks via ODBC is very slow

Re: Upload to Databricks via ODBC is very slow

Re: Upload to Databricks via ODBC is very slow

Re: Upload to Databricks via ODBC is very slow

Re: Upload to Databricks via ODBC is very slow

Re: Upload to Databricks via ODBC is very slow

Re: Upload to Databricks via ODBC is very slow

2025 SAS Hackathon: There is still time!

SAS Training: Just a Click Away