Content Assessment: SAS 9 Code Check for Internationalization
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
In SAS Viya, the default session encoding is UTF-8. UTF-8 is a good choice as it includes all of the characters that can be used to process data in multiple languages. In SAS 9.4 the default encoding was usually a legacy encoding that supported a single language, for example, LATIN /WLATIN for English or Cyrillic/Wcyrillic for Cyrillic based languages. This encoding difference can present issues for SAS programs that are migrated from SAS 9 to Viya. This is particularly true for SAS 9,4 environments in different or multiple languages. In this post, I will introduce a new application that is delivered with SAS 9 Content Assessment that can help identify and resolve issues that result from this change in the default encoding. The December 2021 release of Content Assessment included a new application, SAS 9 Code Check for Internationalization.
Character Encoding
Before we look at the application let's consider internationalization and character encoding. i18n or internationalization refers to the process of preparing software so that it can support local languages and cultural settings. Character encoding is a big part of this preparation. Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers
The best definition of encoding I could find comes from TechTerms.com
"While we view text documents as lines of text, computers actually see them as binary data or a series of ones and zeros. Therefore, the characters within a text document must be represented by numeric codes. In order to accomplish this, the text is saved using one of several types of character encoding."
Encoding and SAS
With SAS Viya the default encoding for SAS sessions changed to UTF8. What is UTF8? UTF-8 is a Unicode encoding. Unicode is the universal character encoding standard that includes the characters in most of the world’s writing systems and the rules for mapping those characters to a number. UTF-8 is a multibyte encoding (MBCS) that represents all of the characters available in Unicode. This is different from the single-byte character set (SBCS) encodings used in SAS 9.4, which represent each character in a single byte.
The 128 characters that make up the ASCII character set are each represented as one byte in UTF-8. Therefore, when the ASCII characters are converted to UTF-8, the size of those characters does not change. All of the other characters available in UTF-8 require 2, 3, or 4 bytes in memory. This includes many characters that are represented with a single byte of memory in the SBCS character encodings.
The following example, from SAS Help Center: Internationalization Compatibility for SAS String Functions shows what can happen in your code, and how you can fix it when you switch from running the code in a SAS session that uses a SBCS to one that runs in UTF-8, which is MBCS.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
If your SAS session in SAS 9.4 used an SBCS encoding (which is likely) then the transition to UTF8 in Viya can result in this, and some other common problems in SAS programs that process data. That is where the new SAS 9 Code Check for Internationalization Content Assessment application comes in. This application is built to help you identify these problems in your SAS code and ease the process of migrating the code to run on SAS Viya.
SAS 9 Code Check for Internationalization
The SAS 9 Code Check for Internationalization application discovers known internationalization issues in SAS programs. The application searches SAS programs for patterns that have been identified as not i18n-compatible in SAS Viya including:
- embedded strings
- concatenated text strings
- locale-sensitive SAS formats/informats
- SAS string functions
- SAS string macros
- column pointer
Most of these issues relate to cases where the software assumes that one character is always represented by one byte in memory as is the case in SBCS encoding. When this is not the case as in UTF8 code that manipulates strings or uses SAS formats can result in data truncation and unexpected results. For a detailed examination of the issues around encoding, I encourage you to check out this excellent paper SAS® and UTF-8: Ultimately the Finest. Your Data and Applications Will Thank You!
Running the Application
The application can be run using the same steps as other content assessment applications. Content Assessment is available for download from support.sas.com.
- install and configure content assessment
- execute application
- published assessed content
- load data and reports to Viya
- view reports
In this post, I am going to skip the steps that install and configure Content Assessment. The details are available in the documentation.
To run the SAS 9 Code Check for Internationalization you will use the i18nCodeCheck executable. This command will run the application against the base SAS code we have in the d:\workshop\gelcorp directory. The application will look for SAS programs in any sub-directory under the directory referenced in the --source-location parameter. The scan-tag will help identify the results. You should use a unique scan tag for each set of directories you process.
i18nCodeCheck.exe --scan-tag basecode --source-location "d:\\workshop\\gelcorp"
It's worth noting that the Code Check for Internationalization has an additional feature that is not common with other Content Assessment Applications. As the log notes, in addition to the Visual Analytics reports it outputs an HTML report on the file system.
The next step for content assessment applications is to publish the content. The publish step aggregates the data and makes it available for reporting.
publishAssessedContent.exe --datamart-type i18n
There are two final steps common to all Content Assessment applications that are required. First, load the data to the Public CAS Library. You can perform this step via the Environment Manager UI or, as shown below, using the cas plugin for the sas-viya command-line interface.
sas-viya.exe cas tables import sas7bdat --replace --force --source-file D:\\workshop\\SAS9ContentAssessment\\assessment\\datamart\\i18n\\i18n_results.sas7bdat
Second, import the relevant report to SAS Viya. You can perform this step via the SAS Environment Manager UI or, as shown below, using the transfer plugin for the sas-viya command-line interface.
sas-viya.exe --output json transfer upload -f D:\\workshop\\SAS9ContentAssessment\\assessment\\packages\\i18n\\SAS_9_Code_Check_for_I18N.json
# using id output from the previous step
sas-viya.exe --output text transfer import --id "23f35d48-758c-42b6-a906-751c3f0adb48"
View the SAS Code Check for Internationalization Report
With the application run, the resulting data loaded to CAS and the report imported we can check out the results. In SAS Drive open Public / SAS Content Assessment / Code Check / SAS 9 Code Check for Internationalization. The Overview page of the report summarizes the status of the code check for internationalization displaying the number of programs compatible with UTF-8 encoding and the number of programs with issues.
It can be useful to review the About This Report page before getting deeper into the results. This page helps to understand the issues flagged, their implications, and what you might change to prevent issues when the code runs in a UTF-8 encoded SAS session in Viya. Another great that helps in understanding the issues and provides suggestions for actions is this Confluence Page
The detailed findings for each SAS program scanned are on the Incompatible Internationalizations page. The table on the page has a line for each program that has problems identified. A program is flagged as incompatible if the code contains programming elements that have possible internationalization issues documented in the About this Report page. The table in the bottom left lists a count of the types of problems identified in the programs checked and can be used to subset the report.
For details on an individual report double-click on a SAS program in the table. In this case, we are looking at the detailed results for one SAS program.
Sometimes incompatibilities are identified that need to be reviewed. For example, the details for this report show that we have some embedded strings which may or may not be a problem. It also shows we have some string functions, specifically substr and scan. For the string functions, the recommendation is to change the code from the current functions that assume the size of a single character is always one byte the corresponding K string function that does not make assumptions about the size of a single character. Lastly, the report shows we have some locale-sensitive formats.
As mentioned earlier, the Code Check for Internationalization application also generates a basic HTML report. This report has a page for each program with issues. The benefit of this report is that it does not require a Viya environment in order to surface the results. The report is stored in the Content Assessment results folder.
Summary
The Code Check for Internationalization application attempts to find known internationalization issues in SAS programs. The application is available with the 2020.2.2 release of SAS Viya. The Code Check for Internationalization application will be invaluable for checking SAS programs for any internationalization issues that may occur when they are migrated to SAS Viya and run under the default UTF-8 encoding.
Additional Resources
SAS Downloads: SAS 9 Content Assessment 2021.2.3
SAS National Language Support Reference Guide
SAS® and UTF-8: Ultimately the Finest. Your Data and Applications Will Thank You!
The SAS® Encoding Journey: A Byte at a Time
Find more articles from SAS Global Enablement and Learning here.