With the global trend to move applications and data to the cloud, SAS customers may also have to move their on-premises data to cloud storage or databases. If the cloud is AWS, then Redshift is a potential candidate. AWS Redshift is a massively parallel data warehousing database, very easy and quick to spin up.
SAS can work with Redshift data very efficiently, whether it is for loading data into Redshift, extracting data from Redshift or processing data inside Redshift. Check Stephen Foerster’s article for an overview of SAS integration with Redshift. In this article, I’ll focus on loading data from SAS (9.4 or SPRE) to Redshift, which could be a typical customer scenario when one wants to progressively move/migrate its data to the cloud.
So, don’t go too fast! Don’t jump too quickly on your keyboard!
There are various options to load SAS data in Redshift. So, you might want to evaluate all of them before clicking on the running man. And I can guarantee some options are worth it.
Assume you have a 15 million rows LINEORDER table to load in Redshift. You first setup 2 libraries:
libname myrs redshift server="<Your-Redshift-server>"
database="<Your-Redshift-database>"
schema="<Your-Redshift-schema>"
user="<Your-Redshift-user>"
password="<Your-Redshift-password>" ;
libname local "~/data" ;
Then, you create the target table structure:
proc sql ;
connect using myrs as myrs_pt ;
execute(
CREATE TABLE lineorder
(
lo_orderkey INTEGER NOT NULL,
lo_linenumber INTEGER NOT NULL,
lo_custkey INTEGER NOT NULL,
lo_partkey INTEGER NOT NULL,
lo_suppkey INTEGER NOT NULL,
lo_orderdate INTEGER NOT NULL,
lo_orderpriority VARCHAR(15) NOT NULL,
lo_shippriority VARCHAR(1) NOT NULL,
lo_quantity INTEGER NOT NULL,
lo_extendedprice INTEGER NOT NULL,
lo_ordertotalprice INTEGER NOT NULL,
lo_discount INTEGER NOT NULL,
lo_revenue INTEGER NOT NULL,
lo_supplycost INTEGER NOT NULL,
lo_tax INTEGER NOT NULL,
lo_commitdate INTEGER NOT NULL,
lo_shipmode VARCHAR(10) NOT NULL
) ;
) by myrs_pt ;
disconnect from myrs_pt ;
quit ;
Now you are ready to load data into the empty Redshift table. Let’s run a basic proc append. Let’s NOT do it on the entire table first.
86 proc append base=myrs.lineorder
87 data=local.lineorder(obs=50000) ;
88 run ;
NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER.
NOTE: There were 50000 observations read from the data set LOCAL.LINEORDER.
NOTE: 50000 observations added.
NOTE: The data set MYRS.LINEORDER has . observations and 17 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 54.37 seconds
cpu time 1.82 seconds
That works well. But 54 seconds to load 50,000 records, it will take me 4.5 hours to load the entire table. There must be a better solution.
SAS/ACCESS has been providing options to control various buffers for years: READBUFF, INSERTBUFF, UPDATEBUFF.
What is the default INSERTBUFF value for Amazon Redshift? And can I expect gains if I increase this buffer? We’ll know it soon. Let’s do it.
How to know the default INSERTBUFF value?
Run this piece of code and look for INSERTBUFF in the SAS log.
options sastrace=",,d," sastraceloc=saslog ;
proc append base=myrs.lineorder
data=local.lineorder(obs=1) ;
run ;
options sastrace=off ;
SAS log (partial):
REDSHIFT: Autoguess INSERTBUFF = 250 20359 1571671882 no_name 0 APPEND
REDSHIFT: Enter setinsertbuff, table is LINEORDER, numrows = 250, statement 0, connection 2 20362 1571671882 no_name 0 APPEND
250 is the default value for Redshift. Let’s try different values like this:
135 proc append base=myrs.lineorder(insertbuff=4096)
136 data=local.lineorder(obs=50000) ;
137 run ;
NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER.
NOTE: There were 50000 observations read from the data set LOCAL.LINEORDER.
NOTE: 50000 observations added.
NOTE: The data set MYRS.LINEORDER has . observations and 17 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 3.80 seconds
cpu time 0.43 seconds
Results:
Huh? Pretty awesome. Divided the run time by 16. 94% faster!
Setting INSERTBUFF above 4096 gives similar run times. I can try with more data to see how it behaves:
84 proc append base=myrs.lineorder(insertbuff=32767)
85 data=local.lineorder(obs=5000000) ;
86 run ;
NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER.
NOTE: There were 5000000 observations read from the data set LOCAL.LINEORDER.
NOTE: 5000000 observations added.
NOTE: The data set MYRS.LINEORDER has . observations and 17 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 5:26.41
cpu time 39.55 seconds
5:26.41 to load 5,000,000 records in Redshift. I can expect to load my entire table in about 16 minutes. That’s a great improvement compared to the default 4.5 hours.
Wait. There’s also this bulk loading mechanism. Is it available with Redshift? Yes.
Unlike other bulk loading capabilities available with other databases which sometimes require additional software components like SQL*Loader for Oracle, the Redshift one relies only on using AWS S3 as a staging area for moving data.
When bulk loading is active, SAS exports the SAS data set as a set of text files (dat extension) using a default delimiter (the bell character), loads them in AWS S3 using the AWS S3 API, and finally run a Redshift COPY command to load the text files into an existing Redshift table. This is particularly efficient.
From a code perspective, it looks like this:
84 proc append base=myrs.lineorder(bulkload=yes
84 ! bl_bucket="sas-viyadeploymentworkshop/gel/aws_data_management/redshift/temp_bulk_loading")
85 data=local.lineorder(obs=5000000) ;
86 run ;
NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER.
NOTE: There were 5000000 observations read from the data set LOCAL.LINEORDER.
NOTE: 5000000 observations added.
NOTE: The data set MYRS.LINEORDER has . observations and 17 variables.
62 1571739581 no_name 0 APPEND
REDSHIFT_15: Executed: on connection 2 63 1571739581 no_name 0 APPEND
copy "public".LINEORDER ("lo_orderkey","lo_linenumber","lo_custkey","lo_partkey","lo_suppkey","lo_orderdate","lo_orderpriority","lo_shippriority","lo_quantity","lo_extendedprice","lo_ordertotalprice","lo_discount","lo_revenue","lo_supplycost","lo_tax","lo_commitdate","lo_shipmode") FROM 's3://sas-viyadeploymentworkshop/gel/aws_data_management/redshift/temp_bulk_loading/SASRSBL_1AC33569-9638-3B47-9714-E1F5D307B619.manifest' ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN '' DELIMITER '\007' MANIFEST REGION 'us-east-1' 64 1571739581 no_name 0 APPEND
65 1571739581 no_name 0 APPEND
NOTE: PROCEDURE APPEND used (Total process time):
real time 31.91 seconds
cpu time 20.52 seconds
Wow! Divided the run time again by 10. 90% faster than using INSERTBUFF! I can expect to load my entire table in about 1:30 (90 seconds). Amazing.
In order to use the Redshift bulk loading feature, you need to have the AWS keys properly setup. In my case, the AWS keys were defined in my user profile, under a .aws sub-directory. SAS gets them automatically and I had only to specify the target AWS S3 bucket to store the temporary files in, using the BL_BUCKET option. You might have to use additional bulk loading options to set your AWS profile or config file or AWS keys directly in the program (though not recommended). For more information, check the documentation.
So, from the default behavior to the bulk load test, I was able to reduce the run time by 99.4%.
The numbers observed were measured in a particularly favorable environment for that scenario since SAS Viya (and SPRE) was deployed on AWS EC2 machines residing in the same AWS region as the Redshift single-node cluster. Thus, the bandwidth between SAS, S3 and Redshift was very high.
This is probably not the use case I mentioned earlier when I was talking about potential customers moving/migrating their data from on-premises to the cloud. However, it shows how efficient SAS can be with Redshift data when SAS is deployed in AWS, which is a scenario that will become very common, if it’s not already the case today.
Notice that the Redshift bulk loading principle (using S3 as a staging area between SAS and Redshift) applies to many data loading/unloading situations using SAS, even if in this blog I focused only on loading data from SAS to Redshift using a SAS engine:
To simulate moving data from on-premises to the cloud, I took an extreme case. Loading SAS data from my SAS installation on my laptop located in France (not a very good idea when the Redshift instance is in Northern Virginia) using a poor upload bandwidth. Resulting timings are not as awesome as previously (SAS deployed in AWS) but you get an idea of the performance of the different options.
In this context, an additional option could be very useful when the bandwidth is limited. BL_COMPRESS compresses the data files using the gzip format on the SAS engine machine before moving them to AWS S3. This is more CPU intensive on the SAS machine. In this case, this improved the bulk loading by around 2.5 times. And it is 5 times faster than the default loading option.
Thanks for reading.
Nice benchmark Nicolas! I'll point to it in the next edition of my book. Benchmarked comparisons are the best; Without figures to back them, performance discussions are just fluff. 🙂
Are all of the examples done with SAS/ACCESS for Redshift installed, or can the be done with SAS/ACCESS for ODBC? Used the first two methods successfully with smaller amounts of data with SAS/ACCESS for ODBC, but not the Bulk Load.
Thanks for your message @BarryParish2 .
No there won't be the same capabilities between SAS/ACCESS to Redshift and SAS/ACCESS to ODBC. Although there seems to be a BULKLOAD option in SAS/ACCESS to ODBC, it is not equivalent to the SAS/ACCESS to Redshift bulkload option which relies on S3.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.