Hello,
we have been experimenting with uploading results generated with SAS to a Databricks SQL warehouse. The code looks as follows:
libname dbx odbc prompt="Driver={Simba Spark ODBC Driver};
Host=foo.cloud.databricks.com;
Port=443;
HTTPPath=/sql/1.0/warehouses/bar;
SSL=1;
ThriftTransport=2;
AuthMech=3;
UID=token;
PWD=baz;
Catalog=foofoo;
Schema=barbar;
DefaultStringColumnLength=32767"
dbcommit=10000
insertbuff=10000
readbuff=1000
dbcreate_table_opts="TBLPROPERTIES('delta.columnMapping.mode' = 'name', 'delta.checkpoint.writeStatsAsStruct' = 'false', 'delta.autoOptimize.optimizeWrite' = 'true')"
preserve_col_names=yes;
proc sql;
create table dbx.test as
select * from test;
run;
The performance leaves a lot to be desired unfortunately. As an example, a dataset with 122 variables and 127 thousand observations takes between five and six minutes to upload, whereas it takes less than 20 seconds to upload to a Microsoft SQL Server DB (also via ODBC).
Is this expected? According to the documentation, Databricks can also be accessed via JDBC and Spark SAS ACCESS modules, but we unfortunately do not have them licensed. Are there any options which could improve ODBC upload performance? I tried increasing insertbuff and dbcommit further, but I get the following error when going above 12k:
ERROR: CLI execute error: [Simba][Hardy] (130) An error occurred while an INSERT statement which causes the driver to reconnect to the server.
Thanks for your help in advance!
There are a lot of parameters that can affect performance in such use case.
If few questions:
- Where does your comparison SQL Server reside (vs Databricks)?
- Have you tried to load this data by other means into Databricks? Locally (to rule out actual load problems), remotely (compare with SAS).
Some hints might be given if you active tracing:
options sastrace=',,,d' sastraceloc=saslog nostsuffix msglevel=i;
The architecture is somewhat complex due to historical reasons:
Creating the table directly in databricks (CTAS from the SQL Server table connected as catalog) takes two minutes. So still much longer than creating a SQL Server table, but nevertheless two to three times faster than creating it via ODBC.
Did you try to use PROC COPY instead of PROC SQL?
proc sql; create table dbx.test as select * from test; run; ------------> proc copy in=work out=dbx; select test; run;
What is the download performance for the same table? If both are taking a similar amount of time then I suggest that it is your network connection to Databricks that is the problem.
Download performance is not an issue. If it does not fail due to insufficient memory (I need to investigate why this happens), the entire table is downloaded in around 10 seconds.
In my experience tuning INSERTBUFF works for SQL Server but it looks like to be not the case with Databricks. Have you talked to the Databricks DBAs to see if they have any ideas about the slowness? It might be worth trying bulk loading to see if that helps.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.