Can documents written in different languages be analyzed using Visual Text Analytics?

2 Likes

What if you have to process documents written in different languages? The purpose of this post is to present some tips on using SAS Viya Visual Text Analytics to ensure you get reliable results.

Multi-Lingual text appears in documents in two main ways or categories:

Some documents combine multiple languages.
Each document is written in a specific language.

The first category did not seem to pose a problem in the situation where, for example, an English document includes the French phrase “c’est la vie” and Spanish “Muchas felicidades a los dos.” These phrases were detected by the Concepts node of my text analysis project, but when an input document contains multiple languages, the results might not be accurate.

Just to illustrate, here is a Visual Text Analytics Concepts node test of the two previous sentences using the predefined nlpNounGroup concept in an English language-based project. (See my post on creating Visual Text Analytics projects for background on how to use the interface.)

Select any image to see a larger version.

Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.

There are 5 matched noun groups spanning 3 languages highlighted in the Test Sample Text document. So, we see that Noun Groups for at least some languages are detected!

Now consider the case where each document is written in a specific language. Multiple languages may appear in document collections related to health care, physicians’ notes, survey responses, social media comments, world news, tourism, or insurance. Documents written in different languages add complexity to the analysis of the document collection because different languages use different linguistic structures. For example, certain languages follow S-O-V (subject – ob ject – verb) structures while others use S-V-O or V-S-O order (Imagine a Yoda voice saying “Hmm.. away put your weapon”)

Not all languages follow the structure of the English language. English has tenses and verb conjugations, but Chinese and Polynesian languages for example, do not. Subject-verb agreement exists in English but not in Chinese. A mix of languages processed together may present challenges. If you have to analyze many documents in different languages, it makes sense to process them according to their own kind for the best results.

If you are processing text data that has documents in multiple languages, SAS Viya Visual Text Analytics easily identifies languages present in the corpus and helps you create separate pipelines for each language in the collection. You use the Data node to select documents for your project. Starting with SAS Viya 2021.2.4, automatic language detection is available in the Data node!

I created a test document collection of about 90 documents by adding several paragraphs from six different languages related to opera houses, NLP research, news articles, etc. I put these ‘documents’ into a spreadsheet with one document per row and imported it into Viya.

There is a drop-down selection to activate automatic language detection in Model Studio. Click the "Detect Languages" button.

This launches a process that reads the corpus, identifies languages, and flags each document with a language code.

Notice that a new language identifier variable _language_ is added to the original data. A best practice is to set this new _language_ variable as a display variable to see it in subsequent nodes of your project.

By examining the data table, we can check that the detected language code is correctly assigned for each document.

Within a Pipeline, you can choose one language from the list of identified languages to function as the primary language for your project. I do not recommend doing this unless your documents only have occasional multi-language phrases.

By selecting the English language in this example, the entire multi-language document collection is processed from the perspective of English.

If I run this pipeline, the standard predefined nlp concepts (like noun groups) find matches for several languages as shown in the first screen capture.

The same appears true initially for the Text Parsing and Topics nodes. A closer look at the system generated topics for the English pipeline reveals that the discovered topics are comprised of terms that would be considered "stop words" in each language. These generated topics were just based on the frequency of terms that were in the small test document collection. From the English perspective, these terms could have been nothing more than industry-specific terminology such as internal product categories.

Selecting the third topic from the top and displaying the matched documents reveals that although the topic did identify French documents, it was based on uninformative terms (de, la, des, les).

So, using English to process documents from multiple languages is clearly not a best practice!

Now let’s build a language specific pipeline for just the French documents using the properties of the Data node.

The topics generated for the French language contain actual French words rather than the uninformative words detected in the previous English pipeline. A best practice is to create additional pipelines and subset the documents by language for further processing.

An alternate approach:

If you have to process multiple languages, you may want to first translate all documents to English, and then run the NLP (Natural Language Processing). This approach requires that you pre-process the documents with one of many available commercial translation software offerings, but this may yield good overall results. (Be confident that you trust the accuracy of the translation especially if domain-specific terms and phrases exist in the documents).

There is a cas action that can programmatically add the language identifier to an output table. Use this output table to feed select languages into your translation software. Then combine the translated results into a table for your text analytics project.

The following code can be run from SAS Studio and creates an output table with a language identifier for each document stored as the out_language table.

 proc cas;                                        /*1*/
   session casauto;

   builtins.loadActionSet /                       /*2*/
      actionSet="textManagement";
   run;

   textManagement.identifyLanguage /              /*3*/
      casOut={caslib="casuser",name="out_language", replace=TRUE}
      docId="row"
      table={name="query"}
      text="document";
   run;

   table.fetch /                                  /*4*/ 
      table={name="out_language"};
   run;

quit;

Section 1 establishes a session with cas.

Section 2 loads the textManagement action set.

Section 3 writes the ISO 639-1 language code in the output table.

Section 4 displays language identification information.

The _language_ variable in the following table was created by section 3 of the previous code example:

More details are available in the product documentation

I hope the automatic language detection capability comes in useful for those of you working in multiple languages. Thanks for reading!

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library