These tips worked well for increasing the speed and success of the CSV middle man - but it was already sufficiently fast. I've made these adjustments to our CSV process, but I'd like to not use a CSV at all. We're just trying to "move"/"load" data currently in SAS to be on Redshift or "query"/"pull" from Redshift into SAS, but ideally without any hard coding or unique consideration for the contents of the dataset.
I'd like the coding required to do this to be as simple as one of these, but run as fast as this CSV via S3 process.
data AWSLIBNAME.AWS_TABLE_NAME;
set SASLIBNAME.SAS_TABLE_NAME;
run;
PROC SQL;
create table AWSLIBNAME.AWS_TABLE_NAME as
SELECT * FROM SASLIBNAME.SAS_TABLE_NAME;
QUIT;
Related to "1) Don't bother to include the header row, since you already know what you are sending.", in the case of the CSV process this is true and a helpful tip. In practice what we are sending changes each time and may not be "known" each time - so I suspect you mean know based on creating the table first after looking?
It is a downside that we have to create the table in AWS ahead of time to use the CSV and right now we're using metadata mining to help do that part programmatically, but I was hoping that if we removed the CSV middle man, we could also remove the need for "knowing what we're sending" step too. I'm hoping SAS would make those data type decisions, but maybe I'm misunderstanding.
Related to "embedded end of line characters", I'm still having spill over issues that look to be caused by line feeds and data type mismatches:
ERROR: CLI execute error: [SAS][ODBC Redshift Wire Protocol driver][Redshift]ERROR: Load into table 'delgoffb_bulktest' failed.
Check 'stl_load_errors' system table for details.(File /home/ec2-user/padb/src/pg/src/backend/commands/commands_copy.c; Line
769; Routine CheckMaxRowError; )
When running:
PROC SQL;
&RDWWORKCONNECT;
CREATE TABLE LAST_RUN_LOADERRORS AS
SELECT DISTINCT * &RDWFROMCONNECT (
select * from stl_load_errors
)
&RDWCLOSE
having max(starttime) = starttime /*Select Only Those from Most Recent Starttime*/
order by starttime desc
;QUIT;
The error is:
" CR ham and beef" is not an acceptable value to the field 'fake_identifier' (which is defined as integer)
You see this in the interim CSV:
fake_identifier,rdw_date,wonky_field 800231,12/30/2024,tab 800230,12/30/2024,", CR ham and beef",,
To avoid this I usually parse the data using code like this in a data step (but that all increases run time) to remove anything that could potentially cause an issue or doesn't need to be there in the final value (like HTML or plain text tags):
[more_code]
%if &CHAR_CLEAN. = Y %then %do;
*The default cleaning to convert the delimiter to
something else in the data should be performed unless
you are confident it does not exist in all text fields;
array clean_fancy _CHAR_;
do over Clean_fancy;
%if &FREE_TEXT.=Y %then %do;
*if there are not fields that have the opportunity
to contain these issues, dont run this extra code
as it will increase run-time unecessarily.;
*remove plain text;
CLEAN_FANCY = tranwrd(CLEAN_FANCY,'<','<');
CLEAN_FANCY = tranwrd(CLEAN_FANCY,'>','>');
CLEAN_FANCY = tranwrd(CLEAN_FANCY,'&','<p>');
CLEAN_FANCY = tranwrd(CLEAN_FANCY,'nbsp;','<p>');
*remove any html tags;
retain rx1;
if _n_=1 then rx1=prxparse("s/<.*?>//");
call prxchange(rx1,-1,Clean_Fancy);
drop rx1;
*strip blanks;
CLEAN_FANCY=compress(CLEAN_FANCY,'0A090B0D'x);
CLEAN_FANCY=strip(CLEAN_FANCY);
%end;
*Translate the delimiter in text fields to something else to avoid upload issues;
CLEAN_FANCY=translate(CLEAN_FANCY,"'","`");
end; *end array;
%end;
[more code]
It works, but just in much more lines of code, data changing, and processing time.
... View more