BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Shakir_Juolay
Obsidian | Level 7

I have created a Teradata table as below

create multiset table d_test.test_char_length (
    contract_id int
   ,var1 varchar(5) character set latin
   ,var2 varchar(5) character set unicode
)unique primary index(contract_id)
;

Inserted a dummy row as below

insert into d_test.test_char_length values (123,'home','home');

Below is the SAS code to read the Teradata table and PROC CONTENTS on the table in Teradata and in SAS.

data test_char_length;
    set d_test.test_char_length;
run;

proc contents data=d_test.test_char_length;
run;

proc contents data=test_char_length;
run;

Below is the output

                                                                                                 10:14 Tuesday, October 16, 2018   1

                                                       The CONTENTS Procedure

                              Data Set Name        D_TEST.test_char_length    Observations          . 
                              Member Type          DATA                       Variables             3 
                              Engine               TERADATA                   Indexes               0 
                              Created              .                          Observation Length    0 
                              Last Modified        .                          Deleted Observations  0 
                              Protection                                      Compressed            NO
                              Data Set Type                                   Sorted                NO
                              Label                                                                   
                              Data Representation  Default                                            
                              Encoding             Default                                            


                                             Alphabetic List of Variables and Attributes
 
                                #    Variable       Type    Len    Format    Informat    Label

                                1    contract_id    Num       8    11.       11.         contract_id
                                2    var1           Char     10    $10.      $10.        var1       
                                3    var2           Char     15    $15.      $15.        var2       
                                                                                                 10:14 Tuesday, October 16, 2018   2

                                                       The CONTENTS Procedure

               Data Set Name        WORK.TEST_CHAR_LENGTH                                    Observations          1 
               Member Type          DATA                                                     Variables             3 
               Engine               V9                                                       Indexes               0 
               Created              10/16/2018 10:22:00                                      Observation Length    40
               Last Modified        10/16/2018 10:22:00                                      Deleted Observations  0 
               Protection                                                                    Compressed            NO
               Data Set Type                                                                 Sorted                NO
               Label                                                                                                 
               Data Representation  SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64                            
               Encoding             utf-8  Unicode (UTF-8)                                                           


                                                 Engine/Host Dependent Information

  Data Set Page Size          65536                                                                                               
  Number of Data Set Pages    1                                                                                                   
  First Data Page             1                                                                                                   
  Max Obs per Page            1632                                                                                                
  Obs in First Data Page      1                                                                                                   
  Number of Data Set Repairs  0                                                                                                   
  Filename                    /saswork/SAS_work0E0F00005108_sas-compute/SAS_workD74B00005108_sas-compute/test_char_length.sas7bdat
  Release Created             9.0401M4                                                                                            
  Host Created                Linux                                                                                               
  Inode Number                4063249                                                                                             
  Access Permission           rw-r--r--                                                                                           
  Owner Name                  unxsrv                                                                                              
  File Size                   128KB                                                                                               
  File Size (bytes)           131072                                                                                              


                                             Alphabetic List of Variables and Attributes
 
                                #    Variable       Type    Len    Format    Informat    Label

                                1    contract_id    Num       8    11.       11.         contract_id
                                2    var1           Char     10    $10.      $10.        var1       
                                3    var2           Char     15    $15.      $15.        var2       

My question is why the length of var1 (LATIN encoding in Teradata) in SAS is 10 instead of 5 and var2 (UNICODE encoding in Teradata) is 15 instead of 5. I was expecting var1 to be 5 and var2 to be 10 in SAS.

 

Regards,

Shakir

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

So you have UTF-8 encoding all the way through.

 

Mind that SAS keeps encoding on a file/dataset and not a column level. You can see that in your proc contents output.

 

My final guess: SAS tries to "play it safe" with regards to defined lengths; if that does not work for you, you will have to override that, either by setting lengths explicitly, or by trying to work with dbsastype options.

View solution in original post

7 REPLIES 7
Kurt_Bremser
Super User

Look at the contents of var1 and var2 with $hex20. and $hex30., respectively. You might find surprising things in there.

My guess:

SAS recodes the latin to UTF-16 for var1, and keeps UTF-8 (which may need more than 2 bytes per character) for var2.

 

Shakir_Juolay
Obsidian | Level 7

This is what I did

data test_char_length;
    set d_dmin.test_char_length;
    put var1 $10.;
    put var1 $hex20.;
    put var2 $15.;
    put var2 $hex30.;
run;

This is the log

15         data test_char_length;
16             set d_dmin.test_char_length;
17             put var1 $10.;
18             put var1 $hex20.;
19             put var2 $15.;
20             put var2 $hex30.;
21         run;

home      
686F6D65202020202020
home           
686F6D652020202020202020202020
NOTE: There were 1 observations read from the data set D_DMIN.test_char_length.
NOTE: The data set WORK.TEST_CHAR_LENGTH has 1 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.16 seconds
      cpu time            0.03 seconds

Also this post https://community.teradata.com/t5/Database/UTF-8-or-UTF-16/m-p/61680 says unicode in Teradata is always UTF-16.

 

Is it possible in SAS for one column to be LATIN and another one to be UTF8 or UTF16 in the same Table.

Shakir_Juolay
Obsidian | Level 7

Also can we somehow configure SAS and/or Teradata Client on the SAS server, so that for var1 the length in SAS is 5, since its LATIN and 1 byte per character is enough for LATIN.

Kurt_Bremser
Super User

So you have UTF-8 encoding all the way through.

 

Mind that SAS keeps encoding on a file/dataset and not a column level. You can see that in your proc contents output.

 

My final guess: SAS tries to "play it safe" with regards to defined lengths; if that does not work for you, you will have to override that, either by setting lengths explicitly, or by trying to work with dbsastype options.

Shakir_Juolay
Obsidian | Level 7

Dear Kurt,

 

I see what you mean by SAS tries to "play it safe" with regards to defined lengths. There was a gap in my understanding of UTF-8 and how many bytes are used by it for storage. When I read the Efficiency section of https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings I see that characters with code points with 8 bits i.e. U+0080 to U+00FF are stored in UTF-8 in 2 bytes and not one. And characters with code points with 16 bits are stored with 3 bytes.

 

Teradata LATIN CHAR(n) needs 8 bits per n hence in SAS UTF-8 the length becomes 2n, and Teradata UNICODE CHAR(n) needs 16 bits per n hence in SAS UTF-8 the length becomes 3n.

 

Now I store English characters in Teradata LATIN which only need 7 bits so I can use dbsastype to change Teradata LATIN CHAR(n) to SAS UTF-8 length n.

 

I use Teradata UNICODE only for Arabic characters, which fit in UTF-8 with 16 bits so I can use dbsastype to change Teradata UNICODE CHAR(n) to SAS UTF-8 length 2n.

 

It appears using dbsastype, I will have to do column by column, is there a universal SAS OPTION to do all columns together? Else I write a SAS macro leveraging Teradata dictionary under DB DBC.* and dbsastype.

 

Thank You Kurt.

Tom
Super User Tom
Super User

There are settings that do help tell SAS how to automatically adjust lengths to allow room for transcoding of characters.

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=acreldb&docsetTarget=n1n4...

 

But if you want a specific behavior then you will need to code your own logic.

 

Note that just because your Teradata column is using LATIN1 it does not mean that your SAS variable can store the value in the same number of bytes.  In SAS you cannot have one variable using one encoding and another using a different encoding. So if you want to read your UTF8 data then SAS will need to use UTF8 for your dataset. That means that any non-standard (non 7bit) ASCII codes in your LATIN1 data will take more than one byte to store in the SAS variable.

Shakir_Juolay
Obsidian | Level 7

Dear Tom,

 

Thank you for sharing this option with me.

 

I will play around with it. I was wondering if this will apply to Teradata as the documentation says

Data source:Oracle

 

I agree with you on "any non-standard (non 7bit) ASCII codes in your LATIN1 data will take more than one byte to store in the SAS variable". At least for the columns that I create using LATIN I know the variable is ASCII and can fit in 7 bits.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 3826 views
  • 0 likes
  • 3 in conversation