Hi,
I've been put in a situation where I need, automatically on an hourly basis, to read some records from a SQL Server database, in Latin-9 encoding, and insert them to another dataset, in Postgres, set up as UTF-8.
My SAS process is reading the selected records via a pass-through, into a WORK dataset, and then attempting to INSERT them in a PROC SQL to the Postgres table.
As expected, SAS is throwing "invalid byte sequence for encoding "UTF8" errors for records with Extended ASCII characters, like the £ sign.
Unfortunately both databases are mature and widely used and I suspect their encoding cannot be changed.
I also can't afford to throw away the non-ASCII characters using the prexchange solution.
Does anyone have any options?
While playing around, I was particularly surprised that this simple line, executed as a pass-through to Postgres, also failed with the same error:
execute ( INSERT INTO myschema.utf_test SELECT '£' ) by PG;
...even though the line of SQL works fine when run directly on Postgres, using e.g. pgAdmin4.
I have no experience with connecting SAS to Postgres DB.
Try making a small SAS dataset that contains UTF-8 characters and see if you can get that to copy into the database.
You can use the ENCODING dataset option to make a copy of the data using UTF-8 encoding.
Example:
data test_latin ;
length string $4 ;
string='B1'x;
put string= $quote. +1 string $hex8.;
run;
data test_utf8(encoding='utf8');
set test_latin;
run;
data _null_;
set test_utf8(encoding='any');
put string= $quote. +1 string $hex8.;
run;
LATIN1 string="±" B1202020 UTF-8 string="±" C2B12020
And then try to transfer it to Postgres. Perhaps if you read it using the ANY encoding like in the little example above it will work?
proc append base=pgdb.test_table data=test_utf8(encoding='any') ;
run;
Are you trying to have FEDSQL make the move for you or are you reading the data into a SAS dataset first?
Going some single byte encoding to multibyte encoding should work in most cases.
Tricky thing is not make sure you store the values into variables that are long enough to hold the expanded length. For example the British Pound symbol will go from using only one byte to use two.
Based on what encoding your SAS session is using you can try different things.
If your session is using single byte encoding then reading form the source into a SAS dataset should work without issues since the values read will fit into the default lengths.
If your session is using UTF8 then to make a temporary SAS table the character variables will need to get longer.
In any case the target table in POSTGRES needs to have the variables defined as long enough to store the expanded values.
Does code like this work?
libname FROM ... sql server database ...;
libname TO ... Postgres database ... ;
proc append
base=to.target_table (encoding='WLATIN1')
data=from.source_table(encoding='UTF8')
;
run;
Hi Tom. Thanks for your reply.
I'm reading from SQLSVR into a temporary SAS dataset first, and then pushing to Postgres. This is required because I need to do some work on the dataset as it sits in SAS before going over to PG.
I modified and ran your code but unfortunately I get these warnings:
The option ENCODING is not implemented in the POSTGRES engine.
The option ENCODING is not implemented in the SQLSVR engine.
...followed by the same UTF8 encoding error.
The PG database has been created as UTF-8, so I'm assuming that when I declare a table in it with a field of x characters wide, PG knows that it could actually be up to 4x that in bytes. Technically I shouldn't need to worry about this?
I've done a PROC OPTIONS, and the SAS session is Latin9.
I have no experience with connecting SAS to Postgres DB.
Try making a small SAS dataset that contains UTF-8 characters and see if you can get that to copy into the database.
You can use the ENCODING dataset option to make a copy of the data using UTF-8 encoding.
Example:
data test_latin ;
length string $4 ;
string='B1'x;
put string= $quote. +1 string $hex8.;
run;
data test_utf8(encoding='utf8');
set test_latin;
run;
data _null_;
set test_utf8(encoding='any');
put string= $quote. +1 string $hex8.;
run;
LATIN1 string="±" B1202020 UTF-8 string="±" C2B12020
And then try to transfer it to Postgres. Perhaps if you read it using the ANY encoding like in the little example above it will work?
proc append base=pgdb.test_table data=test_utf8(encoding='any') ;
run;
Thanks Tom, that's interesting. It got me a little further.
The SAS log confirmed that the transcoding had taken place by showing C2A32020, however when I ran the APPEND, I got this warning and error:
WARNING: Variable string has different lengths on BASE and DATA files (BASE 1 DATA 4).
ERROR: No appending done because of anomalies listed above. Use FORCE option to append these files.
What puzzles me is that it's saying the SAS table is 4 bytes wide (which it is, by your initial table declaration), but it's saying that the Postgres table is only 1 wide. While it's true that the destination PG table has been declared as 1 char wide, it in fact can definitely hold a £ sign (I used pgAdmin4 to insert this test data), so surely behind the scenes it's a kind of variable-byte-width field able to contain a single UTF-8 character of between 1 and 4 bytes.
If I follow the SAS log's "advice" and rerun the append with the FORCE option, I get the original error:
ERROR: During insert: ERROR: invalid byte sequence for encoding "UTF8": 0xa3; Error while executing the query
If I change the width of the original SAS table from 4 to 1, I get this error, as expected:
ERROR: Some character data was lost during transcoding in the dataset WORK.TEST_UTF8. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.
So even though the 2nd SAS table is definitely UTF-8, and the Postgres table is definitely UTF-8, somewhere between the two a process is losing the Unicode specification and attempting to insert it as Latin again.
You should open a ticket with SAS support to see if there is a way to allow SAS to know to transfer the full string and let Posgres deal with strings that are too long.
Truncation would cause the invalid encoding message. Since if SAS only sent the first byte of a character that requires two bytes to represent then that would be an invalid UTF8 code.
As a work around can you try making another table in Posgres where the length of the variables matches the number of BYTES needed to store the UTF8 code (not the number of CHARACTERS they represent). You could then try to load to that table from SAS and then push code into the db to move the records from there to the real table.
Actually Tom, your code works! I was using a PROC SQL at the end to insert the test_utf8 record into PG. If I run your PROC APPEND, it works!
I still can't get the PROC SQL to xfer it as UTF-8 though, but now I know I can use APPEND.
There does appear to be a bug somewhere between SAS and PG, in that if I try to insert more bytes than there are chars in the field, it errors:
ERROR: Some character data was lost during transcoding in the dataset WORK.TEST_UTF8. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.
But if I try to insert the same data using pgAdmin4 it succeeds, as expected. I'll raise this with SAS support.
Thanks for your help with this Tom, I think I have a fairly useable solution now 🙂
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.