The Challenge
Joining data from different data sources (aka heterogeneous join) requires copying the data to SAS for processing which can take a lot of time depending on the size of the data.
The most effective way to process such a join is to utilize database temp tables to perform the join in-database (See my post “How to use database temp tables to improve performance of heterogenous joins” ). This strategy, however, may not be available or permitted. When this is the case, what other options exist for reducing data movement between a database and SAS?
The Solution
One of the best practices when querying a database is to reduce the amount of data that is copied to SAS by limiting columns and rows. To reduce columns, do not “select *” when using PROC SQL or copy a whole database table in a DATA STEP. Instead select only the columns you need. Similarly, to reduce rows use the WHERE statement to subset the data to only what is required. Limiting rows and columns can help improve query performance speed but may not have much of an impact if your query results in copying very large database data (in the 100s of gigabytes or terabytes) to SAS.
Another option is available if you are joining a table to a database table on a field with no more than 4500 unique values. Adding the option MULTI_DATASRC_OPT= IN_CLAUSE to a SAS/ACCESS® interface LIBNAME instructs the PROC SQL optimizer to generate an IN CLAUSE for joins. This prevents SAS from retrieving all the rows from a database table. Instead, it performs a row count operation on each table to determine the larger table*; identifies the unique values in the smaller table; and retrieves only the rows in the larger table that match those unique values.
*To further improve performance by eliminating the row count, use data set option DBLARGETABE=YES. This option identifies which table is larger when processing a join. Note that this option is ignored when outer joins are processed.
An Example
I have two data sets to join in PROC SQL:
A small SAS data set in WORK, customers_1K = 1000 rows, 1 column with up to 1000 unique IDs.
A large database table in Snowflake, trans_100M = 100 million rows, 5 columns with up to 100,000 unique IDs.
When I perform a left join on ID values (scenario 1), it runs in 156.3 seconds. All but 0.15 seconds being the time needed to unload the larger dataset from Snowflake into SAS (with the benefit of BULKUNLOAD available in the SAS/ACCESS® interface to Snowflake).
When I rerun the Snowflake LIBNAME with option MULTI_DATASRC_OPT= IN_CLAUSE and rerun the same join (scenario 2), it runs in 10 seconds ( a 93.5% reduction!). The difference is that instead of copying the entire Snowflake table to SAS, only rows matching the unique values in the smaller table are extracted from the database. The less data required by the query, the less time it takes to move it between Snowflake and SAS.
Finally (scenario 3), after identifying the larger table in my PROC SQL statement using data set option DBLARGETABLE=YES and repeating the join, the time is further reduced by 0.6 seconds. This is because this option eliminates the row count step. It does not count the 1000 or 100 million rows and that saves time.
Scenario
Options used
Time move data from Snowflake to SAS
Real Time
1
None
156.30 sec.
156.45 sec.
2
MULTI_DATASRC_OPT= IN_CLAUSE
10.02 sec.
10.03 sec.
3
MULTI_DATASRC_OPT=IN_CLAUSE,
DBLARGETABLE=YES (on SF Table)
9.28 sec.
9.43 sec.
Code to test on your environment
Use the following code to create the data sets and replicate the scenarios described above. I used SAS/ACCESS® interface to Snowflake. If you use another SAS/ACCESS® interface, substitute your database information in the LIBNAME statements.
Step 1: Add system options to enhance performance statistics in the log.
OPTION
SASTRACE=',,,ds'
SASTRACELOC=SASLOG
NOSTSUFFIX
SQL_IP_TRACE=(note, source)
msglevel=i
FULLSTIMER;
Step 2: Create a LIBNAME to your database without the option MULTI_DATASRC_OPT= IN_CLAUSE
/* LIBNAME macro variables values not shown*/
libname SNOW snow
server=&SFServer
db=&SFDB
user=&SFUser
pw=&SFPW
schema=&SFSchema
bulkload=yes
bulkunload=yes
bl_internal_stage="user/test1";
Step 3: Create a 100 million row data set in Snowflake and a 1000 row data set in SAS WORK.
data snow.trans_100M;
format Date date9.;
do i=1 to 100000000;
ID=Rand('integer', 1, 100000);
Date=Rand('integer', 22995, 23725);
Var2=Rand('integer', 1, 10000);
Var3=Rand('integer', 1, 5000);
Var4=Rand('integer', 1, 40000);
output;
end;
drop i;
run;
data work.customers_1K;
do i=1 to 1000;
ID=Rand('integer', 1, 10000);
output;
end;
drop i;
run;
Step 4: Scenario 1 - Join smaller table in SAS WORK and larger table in Snowflake.
proc sql;
create table work.testa as select
a.id,
b.Date,
b.Var2,
b.Var3,
b.Var4
from work.customers_1K a left join snow.trans_100M b
on a.id = b.id;
quit;
run;
Review the log to get baseline real time and note SAS/ACCESS engine time to unload data to SAS.
SNOWFLAKE: Bulkload seconds used for setup: 105.898558
SNOWFLAKE: Bulkunload conversion (seconds): 145.717032
Summary Statistics for SNOWFLAKE are:
Total row fetch seconds were: 0.000031
Total SQL execution seconds were: 105.896808
Total SQL prepare seconds were: 0.846087
Total SQL describe seconds were: 0.000060
Total seconds used by the SNOWFLAKE ACCESS engine were 156.304091
NOTE: Table WORK.TESTA created, with 999175 rows and 5 columns.
....
NOTE: PROCEDURE SQL used (Total process time):
real time 2:36.45
Step 5: Scenario 2 – Add LIBNAME option MULTI_DATASRC_OPT= IN_CLAUSE. and rerun join of scenario 1.
libname snow clear;
libname SNOW snow
server=&SFServer
db=&SFDB
user=&SFUser
pw=&SFPW
schema=&SFSchema
bulkload=yes
bulkunload=yes
bl_internal_stage="user/test1"
multi_datasrc_opt=in_clause;
proc sql;
create table work.testb as select
a.id,
b.Date,
b.Var2,
b.Var3,
b.Var4
from work.customers_1K a left join snow.trans_100M b
on a.id = b.id;
quit;
run;
Review log to get comparative real time. Note THE SELECT “ID”…..FROM …trans_100M WHERE ((“ID” IN (21, 28, …..). The values in the parentheses are the unique ID values from the smaller SAS data set.
SNOWFLAKE_2: Prepared: on connection 0
SELECT "ID", "Date", "Var2", "Var3", "Var4" FROM "PECHIN"."trans_100M" WHERE ( ( "ID" IN ( 21 , 28 , 48 , 65 , 67 , 80 , 83 ,
84 , 87 , 108 , 120 , 128 , 143 , 146 , 169 , 179 , 204 , 221 , 233 , 236 , 263 , 284 , 288 , 302 , 310 , 314 , 325 , 330 , 337 ,
....
Summary Statistics for SNOWFLAKE are:
Total row fetch seconds were: 0.000030
Total SQL execution seconds were: 9.280490
Total SQL prepare seconds were: 0.394356
Total SQL describe seconds were: 0.000062
Total seconds used by the SNOWFLAKE ACCESS engine were 10.024368
NOTE: Table WORK.TESTB created, with 999175 rows and 5 columns.
89 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 10.03 seconds
Step 6: Scenario 4 – Add data set option (DBLARGETABLE=YES) to PROC SQL query to identify larger table and avoid row count step.
proc sql;
create table work.testb as select
a.id,
b.Date,
b.Var2,
b.Var3,
b.Var4
from work.customers_1K a left join snow.trans_100M (dblargetable=yes) b
on a.id = b.id;
quit;
run;
Review log to get comparative real time.
Summary Statistics for SNOWFLAKE are:
…..
Total seconds used by the SNOWFLAKE ACCESS engine were 9.28
NOTE: Table WORK.TESTB created, with 999175 rows and 5 columns.
89 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 9.43 seconds
Resources
Limiting Retrieval
Passing the WHERE Clause to the DBMS
MULTI_DATASRC_OPT= LIBNAME Statement Option
DBLARGETABLE= Data Set Option
Temporary Table Support for SAS/ACCESS
... View more