I have created a Teradata table as below
create multiset table d_test.test_char_length (
contract_id int
,var1 varchar(5) character set latin
,var2 varchar(5) character set unicode
)unique primary index(contract_id)
;
Inserted a dummy row as below
insert into d_test.test_char_length values (123,'home','home');
Below is the SAS code to read the Teradata table and PROC CONTENTS on the table in Teradata and in SAS.
data test_char_length;
set d_test.test_char_length;
run;
proc contents data=d_test.test_char_length;
run;
proc contents data=test_char_length;
run;
Below is the output
10:14 Tuesday, October 16, 2018 1
The CONTENTS Procedure
Data Set Name D_TEST.test_char_length Observations .
Member Type DATA Variables 3
Engine TERADATA Indexes 0
Created . Observation Length 0
Last Modified . Deleted Observations 0
Protection Compressed NO
Data Set Type Sorted NO
Label
Data Representation Default
Encoding Default
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
1 contract_id Num 8 11. 11. contract_id
2 var1 Char 10 $10. $10. var1
3 var2 Char 15 $15. $15. var2
10:14 Tuesday, October 16, 2018 2
The CONTENTS Procedure
Data Set Name WORK.TEST_CHAR_LENGTH Observations 1
Member Type DATA Variables 3
Engine V9 Indexes 0
Created 10/16/2018 10:22:00 Observation Length 40
Last Modified 10/16/2018 10:22:00 Deleted Observations 0
Protection Compressed NO
Data Set Type Sorted NO
Label
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding utf-8 Unicode (UTF-8)
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 1632
Obs in First Data Page 1
Number of Data Set Repairs 0
Filename /saswork/SAS_work0E0F00005108_sas-compute/SAS_workD74B00005108_sas-compute/test_char_length.sas7bdat
Release Created 9.0401M4
Host Created Linux
Inode Number 4063249
Access Permission rw-r--r--
Owner Name unxsrv
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
1 contract_id Num 8 11. 11. contract_id
2 var1 Char 10 $10. $10. var1
3 var2 Char 15 $15. $15. var2
My question is why the length of var1 (LATIN encoding in Teradata) in SAS is 10 instead of 5 and var2 (UNICODE encoding in Teradata) is 15 instead of 5. I was expecting var1 to be 5 and var2 to be 10 in SAS.
Regards,
Shakir
So you have UTF-8 encoding all the way through.
Mind that SAS keeps encoding on a file/dataset and not a column level. You can see that in your proc contents output.
My final guess: SAS tries to "play it safe" with regards to defined lengths; if that does not work for you, you will have to override that, either by setting lengths explicitly, or by trying to work with dbsastype options.
Look at the contents of var1 and var2 with $hex20. and $hex30., respectively. You might find surprising things in there.
My guess:
SAS recodes the latin to UTF-16 for var1, and keeps UTF-8 (which may need more than 2 bytes per character) for var2.
This is what I did
data test_char_length;
set d_dmin.test_char_length;
put var1 $10.;
put var1 $hex20.;
put var2 $15.;
put var2 $hex30.;
run;
This is the log
15 data test_char_length;
16 set d_dmin.test_char_length;
17 put var1 $10.;
18 put var1 $hex20.;
19 put var2 $15.;
20 put var2 $hex30.;
21 run;
home
686F6D65202020202020
home
686F6D652020202020202020202020
NOTE: There were 1 observations read from the data set D_DMIN.test_char_length.
NOTE: The data set WORK.TEST_CHAR_LENGTH has 1 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.16 seconds
cpu time 0.03 seconds
Also this post https://community.teradata.com/t5/Database/UTF-8-or-UTF-16/m-p/61680 says unicode in Teradata is always UTF-16.
Is it possible in SAS for one column to be LATIN and another one to be UTF8 or UTF16 in the same Table.
Also can we somehow configure SAS and/or Teradata Client on the SAS server, so that for var1 the length in SAS is 5, since its LATIN and 1 byte per character is enough for LATIN.
So you have UTF-8 encoding all the way through.
Mind that SAS keeps encoding on a file/dataset and not a column level. You can see that in your proc contents output.
My final guess: SAS tries to "play it safe" with regards to defined lengths; if that does not work for you, you will have to override that, either by setting lengths explicitly, or by trying to work with dbsastype options.
Dear Kurt,
I see what you mean by SAS tries to "play it safe" with regards to defined lengths. There was a gap in my understanding of UTF-8 and how many bytes are used by it for storage. When I read the Efficiency section of https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings I see that characters with code points with 8 bits i.e. U+0080 to U+00FF are stored in UTF-8 in 2 bytes and not one. And characters with code points with 16 bits are stored with 3 bytes.
Teradata LATIN CHAR(n) needs 8 bits per n hence in SAS UTF-8 the length becomes 2n, and Teradata UNICODE CHAR(n) needs 16 bits per n hence in SAS UTF-8 the length becomes 3n.
Now I store English characters in Teradata LATIN which only need 7 bits so I can use dbsastype to change Teradata LATIN CHAR(n) to SAS UTF-8 length n.
I use Teradata UNICODE only for Arabic characters, which fit in UTF-8 with 16 bits so I can use dbsastype to change Teradata UNICODE CHAR(n) to SAS UTF-8 length 2n.
It appears using dbsastype, I will have to do column by column, is there a universal SAS OPTION to do all columns together? Else I write a SAS macro leveraging Teradata dictionary under DB DBC.* and dbsastype.
Thank You Kurt.
There are settings that do help tell SAS how to automatically adjust lengths to allow room for transcoding of characters.
But if you want a specific behavior then you will need to code your own logic.
Note that just because your Teradata column is using LATIN1 it does not mean that your SAS variable can store the value in the same number of bytes. In SAS you cannot have one variable using one encoding and another using a different encoding. So if you want to read your UTF8 data then SAS will need to use UTF8 for your dataset. That means that any non-standard (non 7bit) ASCII codes in your LATIN1 data will take more than one byte to store in the SAS variable.
Dear Tom,
Thank you for sharing this option with me.
I will play around with it. I was wondering if this will apply to Teradata as the documentation says
Data source: | Oracle |
I agree with you on "any non-standard (non 7bit) ASCII codes in your LATIN1 data will take more than one byte to store in the SAS variable". At least for the columns that I create using LATIN I know the variable is ASCII and can fit in 7 bits.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.