NOTE: I am using what look like macro variables as placeholders for information that may be changed or related to security of data throughout this post. We do not use macro variables in all these cases in the live code, so some macrovariables may not be using appropriate masking functions, quotes, etc. Read these as synonymous to [Add this information] type placeholders.
We are using a libname statement to connect to an AWS Redshift Environment, where we pull and push data.
/*Connect to S3*/
%sysexec(aws sso login --profile &sso_profile.);
/*Connect to AWS Redshift using Libname*/
libname &library_name.
redshift
server="&secretinformation..us-east-2.redshift.amazonaws.com"
database=&database.
port=5439
conopts="encryptionmethod=6;validateservercertificate=0;"
schema="&schema_name."
user=&awsdwid.
password=&awsdwpwd.;
We commonly use PROC SQL for a variety functions either via the libname or pass thru, like so:
/*Run a Query on AWS using Pass Thru*/
PROC SQL;
CONNECT USING &LIBNAME. AS QSELECT;
EXECUTE
(drop table if exists &table_name.)
BY QSELECT;
QUIT;
/*Run a Query in SAS using Libname*/
proc sql;
drop table &libname..&tablename.;
quit;
But they run very slow (ex. 3hrs on 3.1M records) using the libname method when doing SELECT-FROM and INSERT-INTO.
Right now, we are using the following method (with prep and meta data parsing code on either side) to "speed up" this conversation using S3 buckets (from 3hrs to 5mins on that 3.1M records). It DOES run faster, but it requires a CSV middle man that has two data integrity issues in my book: 1) strips metadata and causes more processing to be needed to mock this in AWS and 2) Needing to pre-process CSV files to remove or translate characters that cause reading errors - this I do not like.
/* export sas dataset as a CSV */
proc export
data=sasdata_for_export
outfile="&csvdir.&csvname..csv"
dbms=dlm
replace;
delimiter='`'; *Use the back tick to avoid issues with using a comma delimiter (free text fields cause line feed issues with that delimiter);
putnames=NO; *do not add names here - use a create table statement in AWS to create the table ahead of time;
run;
/* load the CSV to s3 using the cp (copy) command*/
%sysexec(aws s3 cp "&csvdir.&csvname..csv" "s3://&bucket./upload/&csvname..csv" --profile &sso_profile.);
/* copy the CSV from s3 to redshift */
proc sql;
CONNECT USING &LIBNAME. AS QSELECT;
execute (copy res_work.%unquote("&rstblname.") from %unquote(%bquote('s3://&bucket./upload/&csvname..csv')) iam_role &iam_role. delimiter '`' dateformat 'auto' acceptinvchars as ' ' maxerror 10 ROUNDEC; ) BY QSELECT; QUIT;
I have tried to use PROC S3, but couldn't come up with a syntax that worked.
Can anyone provide code that would use either PROC S3, or some other mechanism to avoid the CSV middle man but retain the speeds of the S3 buckets?
Anyone who answers with something that works would also be invited to co-author a Users Group paper/presentation on this.
I am programming using the following:
--- SAS Software ---
SAS Enterprise Guide version: 8.5 (8.5.0.203) (64-bit)
SAS Base Version: 9.4 (TS1M7)
SAS System version: 9412.0.24363.19622
Analytical products:
SAS/STAT 15.2
SAS/ETS 15.2
SAS/IML 15.2
--- Operating System ---
Name: Windows 10 Enterprise
Version: 10.0
Default web browser: Google Chrome
I guess you probably already tried playing bulk-upload?
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/acreldb/n0u88501m7h5vjn1q2x29s6urjmx.htm
[EDIT:]
This paper from SGF 2020 also looks promising: https://support.sas.com/resources/papers/proceedings20/4468-2020.pdf
Bart
I have looked through the syntax and tried to come up with something that worked with these options before, but not had success. If there is guidance on where to find the values for different settings, that may be where things went wrong. I'm not familiar enough with how all the configuration works with these either. There may be steps I missed.
Just in case, I did try this again - but still got this secret key error.
data &libname..delgoffb_bulktest(
bulkload=yes
bl_bucket=&bucket.
bl_use_ssl=yes
bl_region='us-east-2');
set sashelp.class;
run;
ERROR: Message from TKS3: No secret key specified.
But I'm not aware of any "secret key" process we use.
It looks like your AWS connection setup need credentials with key & secret.
According to this doc. note: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/acreldb/p04eb8ihn5xe8in1lzdxep6opttr.htm
"Amazon Redshift, Snowflake: The secret-access-key value specifies the Amazon Web Services (AWS) secret access key or a temporary secret access key. An AWS secret access key is associated with the key ID that you specified with the BL_KEY= data set option. If you are using temporary token credentials, this value is the temporary secret access key."
Maybe ask your admins team about details for credentials?
Bart
@yabwon I received feedback that we are not allowed to create the long term access keys, but found some documentation from AWS on alternatives:
I'm aware that we have IAM roles set-up, and I see some other options in the BL_KEY = documentation (https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/acreldb/n0i0qwjjbfu090n1mi1xxwsc2mbf.htm) that seem to reference these temporary tokens, but I'm still not sure how to programmatically achieve getting those tokens for use here. It's unclear what set of options, with what set of values are required or go together, but it seems like using tokens should only be one option. I suspect a different option or set of options that use the IAM role instead of tokens would be needed, but I don't see this.
@yabwon Its seems I'm getting closer to a solution, but I've encountered an error.
I was able to determine that running the following:
%sysexec(aws sso login --profile &sso_profile.);
will give you an authentication code and have you log in using a UI with that OTC, then dump a file into the .aws cache directory (C:\Users\[username]\.aws\sso\cache\) containing temporary token and secret information.
I was NOT able to find a way to "point" SAS to that information for its use, but it is a JSON file, so you can read it in like this:
filename cred_in "C:\Users\[username]\.aws\sso\cache\[filename].json";
libname cred_lib JSON fileref=cred_in;
and get data that looks like this:
which can be read into macrovariables, and used like this:
data _NULL_;
set cred_lib.root;
call symput ("SECRET",strip(clientSecret));
call symput ("TOKEN",strip(accessToken));
call symput ("REGION",strip(region));
run;
libname AWSWORKB
redshift
server="&server."
database=&database.
port=5439
conopts="encryptionmethod=6;validateservercertificate=0;"
schema="&schema."
user=&awsdwid.
password=&awsdwpwd.
dbcommit = 0 insertbuff = 10000 readbuff=10000
bulkunload=yes
bulkload=yes
bl_bucket="&bucket."
bl_token="&TOKEN."
bl_secret="&SECRET."
;
When I go to use that information/libname I get this in the log:
REDSHIFT: AUTOCOMMIT is YES for connection 7
REDSHIFT_34: Prepared: on connection 7
SELECT TOP 0 * FROM "AWS_SCHEMA".DELGOFFB_BULKTEST2
Summary Statistics for REDSHIFT are:
Total SQL prepare seconds were: 0.026091
Total seconds used by the REDSHIFT ACCESS engine were 0.026230
REDSHIFT: AUTOCOMMIT is YES for connection 9
NOTE: SAS variable labels, formats, and lengths are not written to DBMS tables.
REDSHIFT_35: Executed: on connection 9
CREATE TABLE "AWS_SCHEMA".DELGOFFB_BULKTEST2 (dept_num NUMERIC(11),dept_name VARCHAR(30),dept_specialty_group_desc
VARCHAR(60),appt_desk_id INTEGER,mapped_to_dept_num NUMERIC(11),active_status_code VARCHAR(1),dept_type_code
VARCHAR(1),company_name VARCHAR(180),mc_facility_code VARCHAR(3),facility_num NUMERIC(11),aou_dept_group VARCHAR(17))
REDSHIFT: Bulkload seconds used for setup: 0.029634
REDSHIFT: Bulkload seconds used for file open: 0.003500
NOTE: There were 722 observations read from the data set WORK.TEST_DATA.
NOTE: The data set AWSWORKB.DELGOFFB_BULKTEST2 has 722 observations and 11 variables.
REDSHIFT: Bulkload seconds used for writes: 0.007592
REDSHIFT: Bulkload seconds used for file close: 0.005716
REDSHIFT: Bulkload seconds used for file delete: 0.472694
REDSHIFT: Bulkload total seconds used: 0.519328
Summary Statistics for REDSHIFT are:
Total SQL execution seconds were: 0.773501
Total SQL prepare seconds were: 0.026091
Total seconds used by the REDSHIFT ACCESS engine were 1.321699
ERROR: Message from TKS3: The file is already locked by another user
NOTE: DATA statement used (Total process time):
real time 1.85 seconds
cpu time 0.26 seconds
and open the table to find all the variables but no rows, when I run this code:
options sastrace=',,,ds' sastraceloc=saslog nostsuffix;
data AWSWORKB.DELGOFFB_BULKTEST2
(bulkload=yes
bl_bucket="&bucket.\upload"
bl_token="&TOKEN."
bl_secret="&SECRET."
bl_region='us-east-2'
bl_use_ssl=yes
);
set WORK.TEST_DATA;
run;
Have you or anyone else seen this error? It looks like its doing some work with the bulk load, so I don't think its an access issue - but there's no other "Users" that would have that open. I even tried changing the names, but I think its seeing its own shadow (like a groundhog).
Maybe it's like the last message in this thread:
https://communities.sas.com/t5/All-Things-Community/Bulkload-with-Redshift/m-p/470173#M3253
?
Bart
You haven't posted any SAS logs of your slow select-from and insert-into query steps. It would be helpful to see examples. Often tweeking connection settings can improve performance but without seeing examples it's hard to know what to suggest.
You can use simple code generation to make your current process easier to use. For example your could make an %UPLOAD() macro that takes as inputs the dataset name to upload and the target Redshift table name to create (or insert into).
You can improve your speed by taking the following actions:
1) Don't bother to include the header row, since you already know what you are sending. That way you can replace the PROC EXPORT with a simple data step:
data _null_;
set datasetname;
file csv dsd lrecl=200000;
put (_all_) (+0);
run;
2) Generate a compressed file using the ZIP fileref engine with the GZIP option.
filename csv zip "&csvdir.&csvname..csv.gz" gzip ;
And add the appropriate options on the Redshift side to uncompress as it reads the file. That will reduce a log of I/O. It will make creating the file faster, moving it to S3 faster, and reading it into Redshift faster.
The problem you mention about comma as delimiter should not actually be a concern. Redshift CAN read and create normal CSV that SAS generate where values with embedded delimiters (or quotes) are quoted. You just have to make sure to use the options that remove quotes from around values and treat adjacent quotes as indicating an embedded quote character. Instead of its default of expecting files that use backslash character to escape the embedded delimiters.
You will still need to worry about embedded end of line characters. SAS will not automatically add the quotes around such values that are required to make the generated file readable. You could waste little bit of time by adding quotes around everything.
data _null_;
set datasetname;
file csv dsd lrecl=200000;
put (_all_) (~);
run;
Or work a little harder when generating the code to make the CSV file by adding the quotes around every character variable.
data _null_;
set datasetname;
file csv dsd lrecl=200000;
put name ~ sex ~ age height weight ;
run;
And when moving data from Redshift to SAS you might need to worry about the accidental field merging that can happen in the rare situation where you have one field that starts with single quote and a later field that ends with a single quote. Because SAS treats strings quoted with either single quotes or double quotes the same that can cause the line to look like it has too few values.
If you are using CAS you might look into generating PARQUET files. Those are much smaller and faster to load. But they might take longer to generate since it does take a lot of processing. But again you will get some speed improvement from having to write less data to disk.
These tips worked well for increasing the speed and success of the CSV middle man - but it was already sufficiently fast. I've made these adjustments to our CSV process, but I'd like to not use a CSV at all. We're just trying to "move"/"load" data currently in SAS to be on Redshift or "query"/"pull" from Redshift into SAS, but ideally without any hard coding or unique consideration for the contents of the dataset.
I'd like the coding required to do this to be as simple as one of these, but run as fast as this CSV via S3 process.
data AWSLIBNAME.AWS_TABLE_NAME;
set SASLIBNAME.SAS_TABLE_NAME;
run;
PROC SQL;
create table AWSLIBNAME.AWS_TABLE_NAME as
SELECT * FROM SASLIBNAME.SAS_TABLE_NAME;
QUIT;
Related to "1) Don't bother to include the header row, since you already know what you are sending.", in the case of the CSV process this is true and a helpful tip. In practice what we are sending changes each time and may not be "known" each time - so I suspect you mean know based on creating the table first after looking?
It is a downside that we have to create the table in AWS ahead of time to use the CSV and right now we're using metadata mining to help do that part programmatically, but I was hoping that if we removed the CSV middle man, we could also remove the need for "knowing what we're sending" step too. I'm hoping SAS would make those data type decisions, but maybe I'm misunderstanding.
Related to "embedded end of line characters", I'm still having spill over issues that look to be caused by line feeds and data type mismatches:
ERROR: CLI execute error: [SAS][ODBC Redshift Wire Protocol driver][Redshift]ERROR: Load into table 'delgoffb_bulktest' failed.
Check 'stl_load_errors' system table for details.(File /home/ec2-user/padb/src/pg/src/backend/commands/commands_copy.c; Line
769; Routine CheckMaxRowError; )
When running:
PROC SQL;
&RDWWORKCONNECT;
CREATE TABLE LAST_RUN_LOADERRORS AS
SELECT DISTINCT * &RDWFROMCONNECT (
select * from stl_load_errors
)
&RDWCLOSE
having max(starttime) = starttime /*Select Only Those from Most Recent Starttime*/
order by starttime desc
;QUIT;
The error is:
" CR ham and beef" is not an acceptable value to the field 'fake_identifier' (which is defined as integer)
You see this in the interim CSV:
fake_identifier,rdw_date,wonky_field
800231,12/30/2024,tab
800230,12/30/2024,",
CR ham and beef",,
To avoid this I usually parse the data using code like this in a data step (but that all increases run time) to remove anything that could potentially cause an issue or doesn't need to be there in the final value (like HTML or plain text tags):
[more_code]
%if &CHAR_CLEAN. = Y %then %do;
*The default cleaning to convert the delimiter to
something else in the data should be performed unless
you are confident it does not exist in all text fields;
array clean_fancy _CHAR_;
do over Clean_fancy;
%if &FREE_TEXT.=Y %then %do;
*if there are not fields that have the opportunity
to contain these issues, dont run this extra code
as it will increase run-time unecessarily.;
*remove plain text;
CLEAN_FANCY = tranwrd(CLEAN_FANCY,'<','<');
CLEAN_FANCY = tranwrd(CLEAN_FANCY,'>','>');
CLEAN_FANCY = tranwrd(CLEAN_FANCY,'&','<p>');
CLEAN_FANCY = tranwrd(CLEAN_FANCY,'nbsp;','<p>');
*remove any html tags;
retain rx1;
if _n_=1 then rx1=prxparse("s/<.*?>//");
call prxchange(rx1,-1,Clean_Fancy);
drop rx1;
*strip blanks;
CLEAN_FANCY=compress(CLEAN_FANCY,'0A090B0D'x);
CLEAN_FANCY=strip(CLEAN_FANCY);
%end;
*Translate the delimiter in text fields to something else to avoid upload issues;
CLEAN_FANCY=translate(CLEAN_FANCY,"'","`");
end; *end array;
%end;
[more code]
It works, but just in much more lines of code, data changing, and processing time.
Have you ever tried PROC DATASETS APPEND? We use this with SQL Server and get seriously good performance as long as we also set the INSERTBUFF and DBCOMMIT options:
libname AWSLIBNAME
redshift
server="&secretinformation..us-east-2.redshift.amazonaws.com"
database=&database.
port=5439
conopts="encryptionmethod=6;validateservercertificate=0;"
schema="&schema_name."
user=&awsdwid.
password=&awsdwpwd.
dbcommit = 0
insertbuff = 10000;
proc datasets library = AWSLIBNAME nolist;
append base = AWS_TABLE_NAME data = SASLIBNAME.SAS_TABLE_NAME;
run;
quit;
Playing with the INSERTBUFF value may improve performance further.
This was a pretty good increase in performance!
By adding these options to the existing libname:
dbcommit = 0
insertbuff = 10000
I'm seeing 29 seconds for 315K records instead of 14 minutes with same syntax when compared to previous libname (without those options).
PROC SQL FEEDBACK;
INSERT INTO OLDAWSCONNECT.BED281_&ReqNum._MHNs
SELECT DISTINCT
VAR1
,VAR2
FROM SASLIBNAME.DATASET
;
QUIT;
Then I see 27 seconds when using this syntax to create the blank table, create the SAS table, and then load records in:
PROC SQL FEEDBACK;
&CONNECT_INFO.;
EXECUTE (DROP TABLE IF EXISTS NEWAWS.BED281_&REQNUM._MHNs)&CONNECT_BY. ;
EXECUTE (
CREATE TABLE NEWAWS.BED281_&ReqNum._MHNs
("VAR1" INTEGER NOT NULL
,"VAR2" INTEGER NOT NULL
)
DISTKEY (VAR1)
) &CONNECT_BY. ;
CREATE TABLE SASLIBNAME.DATASET AS
SELECT DISTINCT
VAR1
,VAR2
FROM SASLIBNAME.DATA_WITH_MANYMORE_COLUMNS
;
QUIT;
proc datasets library = NEWAWS nolist;
append base = BED281_&ReqNum._MHNs data = SASLIBNAME.DATASET;
run; quit;
So it seems that the DBCOMMIT and INSERTBUFF options on the libname have helped a lot, so even if I don't get the S3 bin options figured out, this could be a good option.
@Delgoffe_MCRI - That's a big improvement! You could try INSERTBUFF values like 15000 and 20000 to see what the optimal value is.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.