Juletip #21 - Check the lengths of CharVars for UTF8 migration

JussiV · Posted 12-21-2023 02:25 AM

When migrating from SAS 9 environment to SAS Viya we Nordic people have one extra worry to consider. As the supported and default character set is changed from Latin in SAS 9 to UTF-8 in SAS Viya we may need to change the lengths of the character variables in our datasets. The reason is that scandic letters are not stored with 1 byte / character in UTF-8 as they are in Latin character sets.
For example, the text "Börje bor i Åland!" requires 20 bytes in UTF-8 even though in Latin it fits in 18 bytes.
So, the lengths of the character variables may need to be increased when moving from SAS 9 to SAS Viya.
BUT, there is a lot of text fields in datasets that do not need to or they shouldn't be changed. For example, the IBAN country codes are all presented in such values that they use 1 byte / char both in Latin and UTF-8. So, the length of the variable field may work as the data quality check. In these cases it is not wise to start changing the lengths of char fields in SAS Viya.
And also, if text field contains varying lengths of characters, say customer address for instance, it is typical to use some extra bytes to store these values. So, in these cases it is not necessary either to increase the length of the text field when migrating from SAS 9 to SAS Viya.
I have created a SAS macro that scans all of the character variables for a given dataset and checks if the column length may need to be changed when moving from SAS 9 to SAS Viya. It utilizes the observation that all 126 first characters in both character set are similar and they are stored with only 1 byte. If all of the characters in the column consist only from these 126 first characters there is no need to increase the length of the variable when migrating to SAS Viya.

%macro UTF8_Conversion_issues(_DS_NAME_);
/* This SAS Macro scans all character variables in a dataset and checks if 
the lenght of those variables does not need to be changed when moving from
Latin to UTF-8. 
The logic is based on the observation that the first 127 characters in Latin
and UTF-8 character sets are exactly the same, and presented with only one 
byte in UTF-8, too. 
The Macro is planned to be executed in the Latin environment.
*/

/* Example usage:
%UTF8_CONVERSION_ISSUES(SASHELP.CARS);
*/

/* We extract all the character variables form the input dataset.
*/
proc contents 
   data = &_DS_NAME_.
   out = WORK._CONTENTS_(
      where =(Type eq 2))
   noprint;
run;

/* We create a numbered Macro -variable for each character variable. 
*/
data _null_;
   set WORK._CONTENTS_;
   N = left(put(_N_, best12.));
   call symput("_CHAR_VAR_" || N, compress(Name));
   call symput("_VAR_LENGTH_" || N, put(Length, best12.));
   call symput("_N_VARS_", N);
run;

/* Then we loop through each numbered Macro -variable. */
%do i = 1 %to (&_N_VARS_.);

/* We scan through each character of each character variable and check the 
ASCII code value for them. If that ASCII value is bigger then 127 it may need
two or more bytes to be presented in UTF-8. We classify these chracters as 
"Big Ascii"- values.*/
data WORK._SCAN_CHAR_VARS_;
   set &_DS_NAME_.(keep=&&_CHAR_VAR_&i.);
   Value_length = length(&&_CHAR_VAR_&i.);
   do i = 1 to Value_length;
      Curr_Character = substr(&&_CHAR_VAR_&i., i, 1);
      Curr_Ascii_Nro = rank(Curr_Character);
      Max_Ascii_Nro = max(Max_Ascii_Nro, Curr_Ascii_Nro);
      Big_Ascii_Flg = (Curr_Ascii_Nro gt 127);
      Big_Ascii_Cnt = sum(Big_Ascii_Cnt, Big_Ascii_Flg);
   end;
run;

/* Finally, we print the maximum values of the checked values. */
/* AND, we make an extremely conservative assumption that the maximum number
of "Big Ascii" characters would happen to be the lenghtiest character string.
If that value would not fit into the current variable, we warn that the 
length of the variable should to be changed. */
proc sql;
   select 
      "&&_CHAR_VAR_&i." as Variable,
      input("&&_VAR_LENGTH_&i.", best12.) as Var_Length label="Length of the Variable",
      max(Value_length) as Max_Val_Length label="Maximum Value Length",
      max(Max_Ascii_nro) as Max_Ascii_Code label="Maximum ASCII code Value",
      max(Big_Ascii_Cnt) as Big_Ascii_Cnt label="Big-ASCII code Count",
      max(Big_Ascii_Cnt) + max(Value_length) as Recom_Length label="Recommended Length",
      case 
         when ((max(Big_Ascii_Cnt) + max(Value_length)) gt &&_VAR_LENGTH_&i.) then "YES"
         else "NO"
      end as Change_Needed label="Change Needed"
   from WORK._SCAN_CHAR_VARS_;
quit;

%end;

/* Some housekeeping...*/
proc datasets lib= WORK nodetails nolist;
   delete _SCAN_CHAR_VARS_;
   delete _CONTENTS_;
run;

%mend;