How to use database temp tables to improve performance of heterogeneous joins

6 Likes

The Challenge

Analysts often need to query a large database by joining it to a smaller dataset in SAS (or another database). For example, a data set of customer IDs is joined to a transactional database to generate summary statistics per customer. A query that joins data from different sources (aka heterogenous join) requires copying the data to SAS for processing. The larger the data, the longer it will take to copy the data to SAS. If the data you are joining in the database is multi-gigabytes or terabytes, you will experience slow performance times.

Besides the impact of this challenge on productivity, there may also be a cost. Many cloud-based databases charge for extracting data. If this is the case, minimizing data movement between the database and SAS will save money.

The Solution

The most efficient alternative is to have the data and the processing in the same location (aka homogeneous join). Most SAS/ACCESS® interface engines enable the creation of temporary tables in the database where you can copy data into the database temporarily and process the join in-database (see resource section). Although it will still take time to copy a smaller data set into the database, the performance gain by processing the join in-database should be much faster than the heterogenous join.

An Example

I have two data sets to join in PROC SQL:

DATA_100K = 100,000 rows, 4 numeric columns = 3 MB
DATA_1B = 1 billion rows, 4 numeric columns = 7.1 GB

When I perform a heterogenous join (scenario 1) with Data_100K in SAS WORK and Data_1G in Snowflake, it completes in 1 hour 22 minutes. All but 20 seconds of that time was to copy Data_1G from Snowflake to SAS. Fortunately, the SAS/ACCESS® interface to Snowflake enables bulk loading and unloading. Otherwise, the copy time would take longer.

In scenario 2, I copy Data_100K from SAS WORK into a Snowflake temp table, perform the join on Snowflake, and send the output to SAS WORK. This completes in only 16.7 seconds. A 99% reduction! That 16.7 seconds includes: 4.8 sec to copy Data_100K into a Snowflake temp table; 6.2 seconds to process on Snowflake; and 5.7 seconds to send the 5000 row output file to SAS Work.

In scenario 3, I repeat scenario 2 but redirect the output to Snowflake. This drops the total time to 9 seconds.

While SAS/ACCESS® enables the creation of temp tables, your ability to leverage this capability will depend on your database permissions and your organization’s policy. Check with your database administrator to determine whether temp tables are an option for you.

Code to test on your environment

Use the following code to create the data sets and replicate the scenarios described above. I used SAS/ACCESS® interface to Snowflake. If you use another SAS/ACCESS® interface, substitute your database information in the LIBNAME statements.

Step 1: Add system options to enhance performance statistics in the log.

OPTION 
SASTRACE=',,,ds' 
SASTRACELOC=SASLOG 
NOSTSUFFIX 
SQL_IP_TRACE=(note, source) 
msglevel=i 
FULLSTIMER;

Step 2: Create two LIBNAMES to your database. To join a temporary table and a permanent table, you need a libref for each table and these librefs must successfully share a global connection. Note in code below that all LIBNAME options are the same except SNOWTEMP contains dbmstemp=yes.

/* LIBNAME macro variables values not shown*/LIBNAME SNOWPERM snow
server=&SFServer
db=&SFDB
user=&SFUser
pw=&SFPW
schema=&SFSchema
bulkload=yes
bulkunload=yes
bl_internal_stage="user/test1"
connection=GLOBAL
dbcommit=0;

LIBNAME SNOWTEMP snow
server=&SFServer
db=&SFDB
user=&SFUser
pw=&SFPW
schema=&SFSchema
bulkload=yes
bulkunload=yes
bl_internal_stage="user/test1"
connection=GLOBAL
dbcommit=0
dbmstemp=yes;

Step 3: Create a billion row data set in SNOWPERM and 100K row data set in SAS WORK.

data SNOWPERM.DATA_1B;
	do i=1 to 1000000000;
		ID=Rand('integer', 1, 5000);
		Var2=Rand('integer', 1, 10000);
		Var3=Rand('integer', 1, 5000);
		Var4=Rand('integer', 1, 40000);
		output;
	end;
	drop i;
run;

data work.DATA_100K;
	do i=1 to 100000;
		ID=Rand('integer', 1, 5000);
		Var2=Rand('integer', 1, 10000);
		Var3=Rand('integer', 1, 5000);
		Var4=Rand('integer', 1, 40000);
		output;
	end;
	drop i;
run;

Step 4: Scenario 1 - Join smaller table in SAS WORK and larger table in SNOWPERM (aka heterogenous join)

PROC SQL;
	CREATE TABLE WORK.SCENARIO_1 AS
	SELECT
	t1.ID, (AVG(t1.Var2)) LENGTH=8 AS AVG_Var2, (MIN(t2.Var3)) LENGTH=8 AS
	MIN_Var3, (MAX(t2.Var4)) LENGTH=8 AS MAX_Var4
	FROM
	WORK.DATA_100K t1
	INNER JOIN SNOWPERM.DATA_1B t2 ON (t1.ID=t2.ID) GROUP BY
	t1.ID;
QUIT;

Review the log to get baseline real time and note SAS/ACCESS engine time to unload data to SAS.

Table WORK.SCENARIO_1 created, with 5000 rows and 4 columns.

SNOWFLAKE: Bulkload seconds used for setup: 536.267720

SNOWFLAKE: Bulkunload conversion (seconds): 4911.57829

real time 1:22:11.58

Step 5: Scenario 2 - Copy smaller table in SAS WORK to SNOWTEMP, join data in Snowflake and send output to SAS WORK.

DATA SNOWTEMP.DATA_100K;
	SET WORK.DATA_100K;

PROC SQL;
	CREATE TABLE WORK.SCENARIO_2 AS
	SELECT
	t1.ID, (AVG(t1.Var2)) LENGTH=8 AS AVG_Var2, (MIN(t2.Var3)) LENGTH=8 AS
	MIN_Var3, (MAX(t2.Var4)) LENGTH=8 AS MAX_Var4
	FROM
	SNOWTEMP.DATA_100K t1
	INNER JOIN SNOWPERM.DATA_1B t2 ON (t1.ID=t2.ID) GROUP BY
	t1.ID;
QUIT;

Review log to get comparative real time.

/*Copy file from SAS WORK to SNOWTEMP*/

DATA SNOWTEMP.DATA_100K;

SET WORK.DATA_100K;

...

real time 4.8 seconds

NOTE: Table WORK.SCENARIO_2 created, with 5000 rows and 4 columns.
...
real time 11.9 seconds

Step 6: Scenario 3 - Redirect query output from Scenario 2 to SNOWTEMP and run.

PROC SQL;
	CREATE TABLE SNOWTEMP.SCENARIO_3 AS
	SELECT
	t1.ID, (AVG(t1.Var2)) LENGTH=8 AS AVG_Var2, (MIN(t2.Var3)) LENGTH=8 AS
	MIN_Var3, (MAX(t2.Var4)) LENGTH=8 AS MAX_Var4
	FROM
	SNOWTEMP.DATA_100K t1
	INNER JOIN SNOWPERM.DATA_1B t2 ON (t1.ID=t2.ID) GROUP BY
	t1.ID;
QUIT;

Review log to get comparative real time.

...

real time 4.1 seconds

Resources

SAS Communities Library