BookmarkSubscribeRSS Feed

The Natural Language Processing (NLP) - Sentence Splitter SAS Studio Custom Step

Started ‎11-07-2023 by
Modified ‎11-08-2023 by
Views 422
Text data varies in structure, size and content.  Some documents may be short and brief (responses to pointed questions in a survey), while others may tend to be longer and address multiple areas of interest (such as, for example, the review you post after spending the night at an inexpensive rat-infested motel with a window overlooking the sea).  

Natural Language Processing (NLP) applications tend to analyse documents in their entirety.  In certain cases, however, analysis at a more granular level, such as at the sentence or paragraph-level, may be considered because:

1. It enhances the quality of analysis by enabling more granular analysis and localising context in some situations.

2. In some other situations, it may contribute towards more efficient processing by allowing opportunities to remove noise and reduce size of individual payload processed against a set of rules.

A recent open-source contribution, the NLP - Sentence Splitter SAS Studio Custom Step helps data scientists and data engineers split text observations into constituent sentences while retaining identifiers and other metadata.  This is an easy to use pre-processing component which can be applied prior to running NLP.
 
This is part of the SAS Studio Custom Steps GitHub repository,  a collection of low-code components providing a productive and enjoyable developer experience.  These steps provide a user interface for entering parameters, abstraction and enable code reusability for many analytics tasks and programs.

 

 

Link to the repository folder: NLP - Sentence Splitter
Link to the README: README

 

 

Access & Use

 

We first start with the question of accessing these steps from within a SAS Viya environment.  A recommendation is to follow instructions to upload a selected custom step to SAS Viya.  Another alternative is to make use of Git integration functionality already available in SAS Studio.  Clone the SAS Studio Custom Steps GitHub repository and make a copy of required custom steps in your SAS Content folders.  Refer this post for some useful tips. 

 

NLP - Sentence Splitter - Short Video.gif

 

Using this Custom Step is easy.  Referring to the README, you need:

 

1.   A table loaded to SAS Cloud Analytics Services (CAS)

2. A column containing text (which may comprise one or more sentences)

3.  A column to be used as a document ID

 

Upon connecting the input table, assigning the required parameters and specifying an output table, you'd notice the following:


1. The output table will now contain a column called _match_text_.  This column contains sentences as separate observations.
 
2. Each sentence is identified by an ID variable, the _sentence_id_.  Note that this sentence ID is within the context of the original document,  i.e. Doc 1 will contain sentence IDs of 1,2,3... n (for the number of sentences within) and Doc 2 will also contain 1,2,3... k sentences.  
 
3. To ensure a unique ID for each combination of sentence and document ID, a new ID column called Obs_ID concatenates the Document ID with the Sentence ID (with some padding applied).
 
4. Individual offsets (_start_ and _end_) denoting where sentences start and end are also provided.

An Application


Let's now consider a quick application showing the benefit of sentence-level analysis: feature-level sentiment!  Consider this delightful review of an establishment I have had the honour of recently staying at (with their name changed since they don't like fame, naturally):

 

"I felt wonderful when I noticed the pretty facade of the Jolly Rodent motel. The kindly lady at the counter was gracious enough to let us know we were early, but she would still make a room ready in five minutes. Passing through the courtyard, I admired the ornate water fountain in the middle of a floral garden. Alas! The scene proves completely different when I took in our room. Not that there was much to take in, it seemed to end right where it began! Gritting my teeth and reassuring myself that a bathroom door located next to the cooking range was a novelty, not an hazard, I tried to freshen up. However, the moldy nature of the bathroom and the rust stains on the basin freaked me out. The restaurant food was middling and insipid, and the rude service of the waiters made us wish we were out of there as soon as possible."
 
Note that I've colour-coded the passage above as per differing levels of sentiment (green indicates positive sentiment, red indicates negative, and the brown is somewhat negative to neutral).  On the whole, upon reading the entire review, it might seem that the review expresses a negative sentiment in general.  But this isn't really fair to the proprietors of the Jolly Rodent, who should be complimented for their beautiful building, front desk service and water fountain.  This is an ideal opportunity for the Sentence Splitter to uncover insights that might have otherwise been obfuscated.  Another benefit is that the smaller text sentences may provide efficiency benefits by allowing developers to weed out noise prior to analytics.  Finally, sentence-level analysis helps support feature-level sentiment analysis or allows for a categorisation or inference to be extracted from a smaller, localised context.  
 
Having run the Sentence Splitter on a table containing the above text, we can then apply an NLP task (such as sentiment analysis) on individual sentences and obtain the following:
 
Screenshot 2023-11-07 at 17.11.15.png
 
Notice how the Sentence Splitter has now enabled individual (positive) sentences to stand out on their own from within an overall negative review?  This use case can be extended and modified for Feature - Sentiment association,  categorisation, progression of sentiment and other applications. 
 
Have fun using the NLP - Sentence Splitter step.  Feel free to get in touch in case of any questions or comments.  
 
Version history
Last update:
‎11-08-2023 11:05 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags