BookmarkSubscribeRSS Feed

How to use LIBNAME option MULTI_DATASRC_OPT= to reduce data movement between SAS and a database

Started Wednesday by
Modified Wednesday by
Views 84

The Challenge

 

Joining data from different data sources (aka heterogeneous join) requires copying the data to SAS for processing which can take a lot of time depending on the size of the data.

 

The most effective way to process such a join is to utilize database temp tables to perform the join in-database (See my post “How to use database temp tables to improve performance of heterogenous joins” ). This strategy, however, may not be available or permitted.  When this is the case, what other options exist for reducing data movement between a database and SAS?

 

The Solution

 

One of the best practices when querying a database is to reduce the amount of data that is copied to SAS by limiting columns and rows.  To reduce columns, do not “select *” when using PROC SQL or copy a whole database table in a DATA STEP.  Instead select only the columns you need.  Similarly, to reduce rows use the WHERE statement to subset the data to only what is required.  Limiting rows and columns can help improve query performance speed but may not have much of an impact if your query results in copying very large database data (in the 100s of gigabytes or terabytes) to SAS.

 

Another option is available if you are joining a table to a database table on a field with no more than 4500 unique values.  Adding the option  MULTI_DATASRC_OPT= IN_CLAUSE  to a SAS/ACCESS® interface LIBNAME instructs the PROC SQL optimizer to generate an IN CLAUSE for joins.  This prevents SAS from retrieving all the rows from a database table.  Instead, it performs a row count operation on each table to determine the larger table*; identifies the unique values in the smaller table; and retrieves only the rows in the larger table that match those unique values.

 

*To further improve performance by eliminating the row count, use data set option DBLARGETABE=YES.  This option identifies which table is larger when processing a join. Note that this option is ignored when outer joins are processed.

 

An Example

 

I have two data sets to join in PROC SQL:

  • A small SAS data set in WORK, customers_1K = 1000 rows, 1 column with up to 1000 unique IDs.
  • A large database table in Snowflake, trans_100M = 100 million rows, 5 columns with up to 100,000 unique IDs.

When I perform a left join on ID values (scenario 1), it runs in 156.3 seconds.  All but 0.15 seconds being the time needed to unload the larger dataset from Snowflake into SAS (with the benefit of BULKUNLOAD available in the SAS/ACCESS® interface to Snowflake).

 

When I rerun the Snowflake LIBNAME with option MULTI_DATASRC_OPT= IN_CLAUSE and rerun the same join (scenario 2), it runs in 10 seconds ( a 93.5% reduction!).  The difference is that instead of copying the entire Snowflake table to SAS, only rows matching the unique values in the smaller table are extracted from the database.  The less data required by the query, the less time it takes to move it between Snowflake and SAS.   

 

Finally (scenario 3), after identifying the larger table in my PROC SQL statement using data set option DBLARGETABLE=YES and repeating the join, the time is further reduced by 0.6 seconds.  This is because this option eliminates the row count step.  It does not count the 1000 or 100 million rows and that saves time.

 

Scenario

Options used

Time move data from Snowflake to SAS     

Real Time

1

None

156.30 sec.

156.45 sec.

2

MULTI_DATASRC_OPT= IN_CLAUSE  

10.02 sec.

10.03 sec.

3

MULTI_DATASRC_OPT=IN_CLAUSE,

DBLARGETABLE=YES (on SF Table)

9.28 sec.

9.43 sec.

 

Code to test on your environment

 

Use the following code to create the data sets and replicate the scenarios described above.  I used SAS/ACCESS® interface to Snowflake.  If you use another SAS/ACCESS® interface, substitute your database information in the LIBNAME statements.

 

Step 1:  Add system options to enhance performance statistics in the log.

OPTION 
SASTRACE=',,,ds' 
SASTRACELOC=SASLOG 
NOSTSUFFIX 
SQL_IP_TRACE=(note, source) 
msglevel=i 
FULLSTIMER;

Step 2: Create a LIBNAME to your database without the option MULTI_DATASRC_OPT= IN_CLAUSE

/* LIBNAME macro variables values not shown*/

libname SNOW snow
server=&SFServer
db=&SFDB
user=&SFUser
pw=&SFPW
schema=&SFSchema
bulkload=yes
bulkunload=yes
bl_internal_stage="user/test1";

Step 3: Create a 100 million row data set in Snowflake and a 1000 row data set in SAS WORK.

data snow.trans_100M;
	format Date date9.;
	do i=1 to 100000000;
		ID=Rand('integer', 1, 100000);
		Date=Rand('integer', 22995, 23725);
		Var2=Rand('integer', 1, 10000);
		Var3=Rand('integer', 1, 5000);
		Var4=Rand('integer', 1, 40000);
		output;
	end;
	drop i;
run;

data work.customers_1K;
	do i=1 to 1000;
		ID=Rand('integer', 1, 10000);
		output;
	end;
	drop i;
run;

Step 4: Scenario 1 - Join smaller table in SAS WORK and larger table in Snowflake.

proc sql;
create table work.testa as select
a.id,
b.Date,
b.Var2,
b.Var3,
b.Var4
from work.customers_1K a left join snow.trans_100M b
on a.id = b.id;
quit;
run;

Review the log to get baseline real time and note SAS/ACCESS engine time to unload data to SAS.

 

SNOWFLAKE:  Bulkload seconds used for setup:         105.898558

SNOWFLAKE:  Bulkunload conversion (seconds):       145.717032

 

Summary Statistics for SNOWFLAKE are:

Total row fetch seconds were:                       0.000031

Total SQL execution seconds were:                 105.896808

Total SQL prepare seconds were:                     0.846087

Total SQL describe seconds were:                    0.000060

Total seconds used by the SNOWFLAKE ACCESS engine were   156.304091

 

NOTE: Table WORK.TESTA created, with 999175 rows and 5 columns.

....

NOTE: PROCEDURE SQL used (Total process time):

      real time           2:36.45

 

 

Step 5: Scenario 2 – Add LIBNAME option MULTI_DATASRC_OPT= IN_CLAUSE. and rerun join of scenario 1.

libname snow clear;

libname SNOW snow
server=&SFServer
db=&SFDB
user=&SFUser
pw=&SFPW
schema=&SFSchema
bulkload=yes
bulkunload=yes
bl_internal_stage="user/test1"
multi_datasrc_opt=in_clause;

proc sql;
create table work.testb as select
a.id,
b.Date,
b.Var2,
b.Var3,
b.Var4
from work.customers_1K a left join snow.trans_100M b
on a.id = b.id;
quit;
run;

Review log to get comparative real time. Note THE SELECT “ID”…..FROM …trans_100M WHERE ((“ID” IN (21, 28, …..).  The values in the parentheses are the unique ID values from the smaller SAS data set.

 

SNOWFLAKE_2: Prepared: on connection 0

SELECT  "ID", "Date", "Var2", "Var3", "Var4"  FROM "PECHIN"."trans_100M"  WHERE  ( ( "ID" IN  ( 21 , 28 , 48 , 65 , 67 , 80 , 83 ,

84 , 87 , 108 , 120 , 128 , 143 , 146 , 169 , 179 , 204 , 221 , 233 , 236 , 263 , 284 , 288 , 302 , 310 , 314 , 325 , 330 , 337 ,

....

Summary Statistics for SNOWFLAKE are:

Total row fetch seconds were:                       0.000030

Total SQL execution seconds were:                   9.280490

Total SQL prepare seconds were:                     0.394356

Total SQL describe seconds were:                    0.000062

Total seconds used by the SNOWFLAKE ACCESS engine were    10.024368

 

NOTE: Table WORK.TESTB created, with 999175 rows and 5 columns.

89   quit;

NOTE: PROCEDURE SQL used (Total process time):

      real time           10.03 seconds

 

Step 6: Scenario 4 – Add data set option (DBLARGETABLE=YES) to PROC SQL query to identify larger table and avoid row count step.

proc sql;
create table work.testb as select
a.id,
b.Date,
b.Var2,
b.Var3,
b.Var4
from work.customers_1K a left join snow.trans_100M (dblargetable=yes) b
on a.id = b.id;
quit;
run;

Review log to get comparative real time.

 

Summary Statistics for SNOWFLAKE are:

…..

Total seconds used by the SNOWFLAKE ACCESS engine were    9.28

 

NOTE: Table WORK.TESTB created, with 999175 rows and 5 columns.

89   quit;

NOTE: PROCEDURE SQL used (Total process time):

      real time           9.43 seconds

 

Resources

 

Version history
Last update:
Wednesday
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags