The purpose of this post is to introduce text analytics to anyone new to the topic who wants to understand what it is and what it does. After reading this post you will have a better idea of what analysts are saying when you hear them talk about text analytics concepts. After all, who doesn’t want to participate intelligently in discussions about interesting analytic topics?
This post is suitable reading for Business Leaders, Analysts from other disciplines, Sales Executives, Systems Engineers and anyone with general interest in this topic. Subsequent post in this series will describe and illustrate applications of text analysis. The concepts presented here provide a foundation that will help you get full value from them and your software.
SAS® Visual Text Analytics analyzes unstructured data, such as text containing customer reviews, social media posts, survey responses, city planning documents, medical records, news feeds, research abstracts, tweets, transcribed phone calls, well you get the idea!
Text is considered “unstructured” in the world of data processing since it doesn’t come in the structured “rows and columns” data format that computers process. It’s easy to analyze basic data in a spreadsheet, but hard to analyze annual reports of all companies in a particular industry sector. You can’t do advanced analytics of free-form text data on a computer without it being represented in a structured format first. This is one of the things that NLP (Natural Language Processing) does. Text represented in this structured format (one example: as a term/document frequency matrix) becomes a building block used by machines to analyze documents.
Information retrieval, exploratory analysis, topic derivation, research, concept creation, category and predictive modeling are some aspects of analyzing text.
You can build text models using your choice of a graphical interface or code. This post shows output of text models generated by the SAS® Model Studio graphical interface.
Let’s get started
First, the raw readable text data (including combinations of pdf documents collections, spreadsheets, RTF text, etc.) needs to be imported into the software. Viya® streamlines the work of importing your documents from a single data source or from directories into the cloud for processing. Some of the data import options are shown here:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Following this, relevant information is extracted from the text documents by parsing and applying Natural Language Processing (NLP). To understand this idea, imagine that you want to parse this paragraph – what do you notice? It is made up of characters and punctuation marks separated by spaces, and we think of these characters as words. In text analytics, groupings of characters between spaces are called ‘tokens’. It may also make sense to group several tokens together to represent a term that has specific meaning (i.e. body mass index, or northern lights). These are referred to as Noun Groups in the software.
Singular and plural versions of nouns and verbs are grouped together as synonyms along with any misspelled words if that option is enabled.
In text analytics, after tokens are extracted, more sophisticated NLP techniques continue to build the analysis and extract terms, concepts, entities and atomic facts that will be discussed later. Not every word found in the document collection should be included in the analysis. For example, if a word occurs in every document, it provides no useful insight into the document collection and should be dropped. Examples of these kinds of common words are; “and, the, a, some”, but the words that are dropped will differ depending on each document collection. If a word occurs in only one document it should not be used for analysis, but it can still be found in a keyword search of documents. You can add terms to a stop list which is used to exclude uninformative terms from analysis.
Some SAS® Visual Text Analytics capabilities and ideas to be aware of are:
Next, we’ll see some examples of results commonly generated by the text analysis application.
The bar chart above shows document frequency for the text parsing results for a collection of documents processed by the Text Parsing node. It shows the Natural Language concepts, parts of speech, and the number of terms kept or dropped from analysis.
Text Analysis results are focused and apply to a specific document collection. This is different than running a general web or ChatGPT search where the content is not always accurate and relevant to the subject matter being explored.
The term map above is from the Text Parsing node. It shows the relationship among relevant terms that are connected to the central term you want to explore (in this case the term selected for the map is “medication”).
Information Gain and document frequency are shown. Information Gain is the additional information obtained by adding a conjoined term in the term map to a current rule. The size of the term node indicates the relative number of documents that include that combination of terms. The darker the term node, the more reliable the rule is for predicting that the term of interest will appear in a document. The relative line thickness indicates the strength of the association between terms.
Clicking on a term in the map displays the relevant documents containing the term on the right of the screen.
The chart above shows the number of documents, and Sentiment for each topic in this document collection. It looks like there is a high proportion of Negative sentiment in these patient’s comments on the effectiveness of their prescription drugs.
System generated and user generated topics can also be created. The topics shown below are from a collection of City Planning pdf documents for several North Carolina cities that I decided to explore. I could select the matching documents for a topic to retrieve only those city plans that I’m interested in exploring further.
You can also create your own custom concepts to identify specific structures in your documents. For example, the terminology found in doctors’ reports on patients is probably different than product customer complaint phrases found on social media. Document collections have their own characteristics that would not likely occur in the same way in a different document collection.
Custom concepts return selected text that is meaningful in specific contexts. For example, searching for the single word ‘bad’ may be interpreted out of context. It has a different meaning if the word “not” precedes it, or even if it occurs within a certain distance of ‘bad’. A custom concept can easily be built to take this context into account. Our SAS Visual Analytics training class gets into a lot more detail on creating custom concepts! You can find more information about this class here: SAS Training in the United States -- SAS® Visual Text Analytics in SAS® Viya®
Custom user concepts are written using the LITI (Language Interpretation for Textual Information) language. It is a powerful way to flex your text analytic muscles and fine tune your analysis. Look for more about the LITI language in future posts and in our SAS Visual Text Analytics class.
The Categories node creates Boolean models - another widely used tool in text analytics. They identify documents that fit into specific categories. The categories can be built using existing categorical input variables in your data source, generated topics, and even special category rules you can write. Score code generated from the categories node can be used on new documents to automatically place them into a category. Once again, this topic is covered in our training class.
After you are satisfied with your Text Analytics results, they can be used to automatically assign (or score) new documents.
Model score code created by the concepts, sentiment, topics and categories nodes in a SAS Model Studio Visual Text analytics pipeline can be run against new documents to make predictions in SAS Studio or in batch code. Here are some common useful applications of a completed text analytical project:
Now that we’ve defined some terms and described some uses of text analysis, let's take a peek at a Text Analytics project that uses the graphical user interface in SAS Model Studio. In subsequent posts in this series, we’ll illustrate performing specific applications of text analyses using this tool.
With the user interface, once the documents are available, you can create and run a pipeline of text analytics tasks from the menu of nodes on the left.
The center section shows one possible arrangement of nodes connected together in a text analytics pipeline.
The nodes have options that control what they do and how they behave. On the right, the options for the Text Parsing node are shown.
This link: SAS Visual Text Analytics Solutions | SAS contains a short 5-minute demo of SAS Visual Text Analytics near the bottom of the page. Look for “Get to Know SAS Visual Text Analytics” to find it quickly.
Output tables created using Text Analytics models are often combined with traditional structured data sources to improve predictions of existing models that do not currently use text scores. By including the additional features (input variables) to these models, previously unknown insights can provide a substantial boost to model precision.
For those who prefer to code, keep on the lookout for another post on how to run text analytics actions using code.
Please leave a comment with additional topics you would like our future posts to highlight.
Thank you for reading this post and be sure to enjoy the video showing SAS Visual Text Analytics in action!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.