Profiling Text with SAS Viya

3 Likes

The increasing popularity of Artificial Intelligence (AI) and consequent hype has led to a misconception that AI can learn on its own, and all we need to do is sit back and relax! Nothing could be farther from the truth, especially in Natural Language Processing (NLP) which deals with mostly subjective text data.

What is possible, is that data scientists and text analysts can make better, informed decisions about their AI approach. For this, they need a thorough understanding of their text data's structure and quality expressed through consumable insights. Such insights can be woven into data preparation, which is acknowledged as one of the most time-consuming tasks in analytics and AI.

This article discusses a recent low-code analytics component which automates text profiling, providing data scientists and decision makers faster time to value!

This component is a SAS Studio Custom Step called Natural Language Processing (NLP) - Profile Text. NLP - Profile Text is intended to be used within a data preparation pipeline, after text data has been ingested and resides in a table. An example Visual Analytics report (included) also shows how to automatically review text profiles and decide upon downstream analysis. Join us in a walkthrough of how to use this custom step.

Accessing the Custom Step

Download the custom step from this GitHub repository. We recommend downloading the entire folder as it contains other useful files such as instructions, screenshots and an example Visual Analytics report. Indicative commands are given below. Note the location of this repository will change in future once the step is included in the more famous SAS Studio Custom Steps GitHub repository.

# Clone the repository

git clone https://github.com/SundareshSankaran/sas-studio-custom-steps.git

# To get to the relevant folder

cd "Natural Language Processing (NLP) - Profile Text"

# List contents

ls -lh

Preparing your Environment

You'll notice a file named "NLP - Profile Text.step". Upload this to your SAS Viya 4 environment using instructions provided in the README (part of the repository).

Let's also import the example Visual Analytics Report. This report shows some possible ways to visualise the custom step output. Again, instructions are provided in the repository. To access the transfer package containing the VA report,

# From the "Natural Language Processing (NLP) - Profile Text" directory

cd extras

# The json file listed refers to the VA report transfer package.

ls -lh

##
# total 15472
# -rw-r--r--  1 ---  staff   7.5M 18 Nov 09:19 Instructions - Import VA report.mp4
# -rw-r--r--  1 ---  staff    60K 18 Nov 09:19 NLP - Text Profile Package.json
# -rw-r--r--  1 ---  staff   1.2K 18 Nov 09:19 README.md
# drwxr-xr-x  3 ---  staff    96B 18 Nov 09:19 img

Finally, let's arrange for data. You have two choices:

- Use the suggested NEWS example dataset, for which here are instructions to load the same.

- If you're more adventurous and wish to use a dataset of your choice, please go ahead! Ensure that the data is loaded in CAS, and has 1 column containing the text to be analysed, as well as a column with a unique ID.

Using the Step

The following video is a visual aid to run the Custom Step.

In brief, all you need to do is:

1. Open a new SAS Studio Flow.

2. Drag a SAS program on to the Flow and run the following to connect to CAS.

cas ss;
/*ss - refers to some session name of your choice*/;

caslib _all_ assign;

3. Attach a table which refers to your input data.

4. Drag the NLP - Profile Text custom step on to the Flow and attach the 'input data' table to the same. Fill in required input parameters. The input parameters are explained here.

5. Attach output tables to your Custom Step. I strongly encourage you to follow the exact same names as shown here, for purposes of facilitating the Visual Analytics report to update automatically. In case you provide different names, bear in mind that you will have to change table references inside VA to point to your output tables.

And that's it! Run the Flow and (if all goes well) you will soon see a green tick mark indicating that the step ran without errors.

Analysing the Results

If you are the type of persona who likes to look at raw tabular output, they are already available in the Flow for you to examine. However, most of us prefer to analyse using visualisation. This Custom Step contains code which promotes output so that it automatically reflects in a Visual Analytics report (hence the earlier recommendation to stick to suggested output table names). Here's how you analyse the results.

1. Open Visual Analytics (from the main menu -> Explore and Visualize)

2. Browse to the following location and open the following report.

All Reports > Public > NLP - Text Profile > reports

Report Name: Text_Profile

You'll notice that the data has been updated after the Custom Step run. A quick description of the pages:

Text Profile Report

This provides a quick summary. The table analysed is named on the top right for reference. At a quick glance, you obtain a high-level overview of the dataset along with Natural Language Generation (NLG)-facilitated text on the bottom right.

For example, from the NEWS dataset, I notice some articles are pretty long, 394 sentences compared with an average of 12 sentences per article. Could this be an outlier? Or, could it be a quality issue, i.e. some sentences lack punctuation and are considered as a single sentence? As an analyst, I would want to go back to my data prep and correct for the same so that my documents are standardised.

Tokens

Here, I find the custom step neatly arranges terms from my corpus into the following categories.

Screenshot 2022-11-20 at 4.11.00 PM.png

1. Stop-words : commonly used terms such as the, and, a, etc.

2. Numeric tokens : self-explanatory

3. Punctuation tokens : punctuation symbols, including sentence and clause separators

4. Content tokens : after removal of stop words, numbers and punctuation, these represent terms which relate to the context of the dataset. They indicate what the corpus is all about.

The tokens page helps me determine both downstream analytics approach and quality correction. For example, understanding the content helps me plan constituents of my taxonomy - I may require a category for sports or technology (going by some of the content words). I can also understand existing patterns - the dominance of the word NOT tells me that since NOT is usually followed by a verb or an adjective, I might like to extract information by writing concept rules for the same.

A brief explanation of the other pages in this report :

Long Sentences and Short Sentences

Helps you look at a distribution of the token lengths per sentence. This way, you determine any extremely long or short sentences, which could either indicate outliers, quality issues, or even stylistic issues (monosyllabic replies like yes/no/okay, for example). You can even alert cases where token length per sentence is above a threshold (I took the mean as a threshold, but you can change this to anything you prefer).

Screenshot 2022-11-20 at 4.13.07 PM.png

Long Documents and Short Documents

Similar to the page above, but with an objective to identify long documents or very short documents (based on number of sentences). Additional questions explorable through this page - are outliers really outliers , or are there multiple document segments need different treatment? For example, a collection of very long documents (a lot of Tolstoys with their War and Peace compositions :)) may exist along with a collection of very short documents (tweets and social media posts, perhaps?). These two document types may warrant different approaches downstream.

Document-level Analysis

This view includes box plots which make it easier to identify outliers.

Further action

Now that you've understood your corpus better, what next? So much about the earlier analysis is about outliers, possible segmentation of the corpus, and preparation for complex NLP tasks (creating a categorisation taxonomy). As these profiles are triggered within a SAS Studio Flow, it's very easy to extend the Flow to include additional tasks such as a filter and output of multiple tables. For example, if we decide that long documents (say 20 + sentences) are sufficiently different from shorter documents, we can filter on the "Document-level" output table and merge the segments back with the original table to create two final datasets. These datasets can then feed a "Long Documents" and "Short Documents" VTA project separately.

Outliers - either very long/short sentences or very long/short documents - could be routed to exception queues (or just another table) where they can be examined and corrected.

In Summary

Better understanding of text data helps us perform relevant and more effective NLP. Before plunging headlong into complex NLP tasks, we would all benefit from using the NLP - Profile Text Custom Step, be able to carry out focussed, relevant and accurate analysis, and maximise chances of success in achieving organisational objectives.

Have fun with the NLP - Profile Text Custom Step! Drop in a comment or email me with any thoughts.