BookmarkSubscribeRSS Feed

The Natural Language Processing (NLP) - Sentence Splitter SAS Studio Custom Step

Started a month ago by
Modified a month ago by
Views 190
Text data varies in structure, size and content.  Some documents may be short and brief (responses to pointed questions in a survey), while others may tend to be longer and address multiple areas of interest (such as, for example, the review you post after spending the night at an inexpensive rat-infested motel with a window overlooking the sea).  

Natural Language Processing (NLP) applications tend to analyse documents in their entirety.  In certain cases, however, analysis at a more granular level, such as at the sentence or paragraph-level, may be considered because:

1. It enhances the quality of analysis by enabling more granular analysis and localising context in some situations.

2. In some other situations, it may contribute towards more efficient processing by allowing opportunities to remove noise and reduce size of individual payload processed against a set of rules.

A recent open-source contribution, the NLP - Sentence Splitter SAS Studio Custom Step helps data scientists and data engineers split text observations into constituent sentences while retaining identifiers and other metadata.  This is an easy to use pre-processing component which can be applied prior to running NLP.
This is part of the SAS Studio Custom Steps GitHub repository,  a collection of low-code components providing a productive and enjoyable developer experience.  These steps provide a user interface for entering parameters, abstraction and enable code reusability for many analytics tasks and programs.



Link to the repository folder: NLP - Sentence Splitter
Link to the README: README



Access & Use


We first start with the question of accessing these steps from within a SAS Viya environment.  A recommendation is to follow instructions to upload a selected custom step to SAS Viya.  Another alternative is to make use of Git integration functionality already available in SAS Studio.  Clone the SAS Studio Custom Steps GitHub repository and make a copy of required custom steps in your SAS Content folders.  Refer this post for some useful tips. 


NLP - Sentence Splitter - Short Video.gif


Using this Custom Step is easy.  Referring to the README, you need:


1.   A table loaded to SAS Cloud Analytics Services (CAS)

2. A column containing text (which may comprise one or more sentences)

3.  A column to be used as a document ID


Upon connecting the input table, assigning the required parameters and specifying an output table, you'd notice the following:

1. The output table will now contain a column called _match_text_.  This column contains sentences as separate observations.
2. Each sentence is identified by an ID variable, the _sentence_id_.  Note that this sentence ID is within the context of the original document,  i.e. Doc 1 will contain sentence IDs of 1,2,3... n (for the number of sentences within) and Doc 2 will also contain 1,2,3... k sentences.  
3. To ensure a unique ID for each combination of sentence and document ID, a new ID column called Obs_ID concatenates the Document ID with the Sentence ID (with some padding applied).
4. Individual offsets (_start_ and _end_) denoting where sentences start and end are also provided.

An Application

Let's now consider a quick application showing the benefit of sentence-level analysis: feature-level sentiment!  Consider this delightful review of an establishment I have had the honour of recently staying at (with their name changed since they don't like fame, naturally):


"I felt wonderful when I noticed the pretty facade of the Jolly Rodent motel. The kindly lady at the counter was gracious enough to let us know we were early, but she would still make a room ready in five minutes. Passing through the courtyard, I admired the ornate water fountain in the middle of a floral garden. Alas! The scene proves completely different when I took in our room. Not that there was much to take in, it seemed to end right where it began! Gritting my teeth and reassuring myself that a bathroom door located next to the cooking range was a novelty, not an hazard, I tried to freshen up. However, the moldy nature of the bathroom and the rust stains on the basin freaked me out. The restaurant food was middling and insipid, and the rude service of the waiters made us wish we were out of there as soon as possible."
Note that I've colour-coded the passage above as per differing levels of sentiment (green indicates positive sentiment, red indicates negative, and the brown is somewhat negative to neutral).  On the whole, upon reading the entire review, it might seem that the review expresses a negative sentiment in general.  But this isn't really fair to the proprietors of the Jolly Rodent, who should be complimented for their beautiful building, front desk service and water fountain.  This is an ideal opportunity for the Sentence Splitter to uncover insights that might have otherwise been obfuscated.  Another benefit is that the smaller text sentences may provide efficiency benefits by allowing developers to weed out noise prior to analytics.  Finally, sentence-level analysis helps support feature-level sentiment analysis or allows for a categorisation or inference to be extracted from a smaller, localised context.  
Having run the Sentence Splitter on a table containing the above text, we can then apply an NLP task (such as sentiment analysis) on individual sentences and obtain the following:
Screenshot 2023-11-07 at 17.11.15.png
Notice how the Sentence Splitter has now enabled individual (positive) sentences to stand out on their own from within an overall negative review?  This use case can be extended and modified for Feature - Sentiment association,  categorisation, progression of sentiment and other applications. 
Have fun using the NLP - Sentence Splitter step.  Feel free to get in touch in case of any questions or comments.  
Version history
Last update:
a month ago
Updated by:



Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags