This article will showcase how to use the Text Mining Node in SAS Model Studio and how the Text Mining node is automatically used in Model Studio’s Automated Pipeline Creation when a variable with a “text” role is detected.
Overview
Text mining is the process of analyzing unstructured data e.g., free form data from the web, comment fields, and other text sources to reveal patterns and insights. By transforming the data into a structured format, a user can use that information for further analysis such as building predictive models.
The SAS Model Studio Text Mining node lets users process text information and create quantitative representations (Singular value decomposition matrix) of the text for further use like modeling. Often, by using the additional information from the text field, you might be able to improve the predictive ability of your models.
One variable must contain the text and if multiple text variables are available in the data, then the Text Mining node uses the one with the largest length by default. However, a user can override this default by simply rejecting other variables with the “text” role in the Data tab. This ensures that only the desired variable is used as an input for the Text Mining node.
Examples
In this demonstration, I'll show two examples of how the Text Mining node can be used in SAS Model Studio to create new quantitative features for further analysis. The first example shows using the Text Mining node in Model Studio pipeline and the second example shows how it is incorporated in Automated Pipeline Creation.
There are currently three text variables in the “Wine_Quality_Reviews” dataset and the Text Mining node will by default be using the “Description” variable for analysis because of its largest length. Description variable represents free-form, unstructured data on wine reviews. As mentioned above, it can be overridden by rejecting the remaining two text variables by using the Data tab and only keeping the one you want to use. But for this example, I will go ahead with the default.
Once the target (Varietal – which represents the variety of wine) is assigned, our next step is to add a Text Mining node under the Pipelines tab by right-clicking on the data node and selecting the Text Mining node from the “Data Mining Preprocessing” group.
When you select the Text Mining node, the Options pane will appear in the right pane.
There are several parsing options available and the image above shows the default parsing options. By selecting “Include parts of speech,” a user is telling the node to use part-of-speech tagging in the parsing process. The “Extract noun groups” option lets users extract noun groups from the chosen text variable during parsing. Similarly, the “Extract entities” option lets users extract entities. The “Stem Terms” option specifies whether to treat different terms but with the same roots as equivalent terms.
“Minimum number of document” property, as the name suggests, specifies the minimum number of documents a term should be in to be kept.
Starting Viya 4 2021.1.3, a user has the option to upload their own lists such as stop list, start list, etc. under the “Custom Lists” property. Before version Viya 4 2021.1.3, default lists are automatically included and applied for all supported languages in the Text Mining node.
Lastly, the “Topic Discovery” property lets the user either automatically determine the number of topics generated or lets the user manually specify the number of topics to be discovered.
For this example, I am using the default settings to run the Text Mining node. Looking at our Results, several windows are available like the table of topics generated, Kept Terms, and Dropped Terms. Kept Terms, and Dropped Terms tables give information on used and ignored terms respectively during the text analysis.
Notice, several terms include the plus sign which indicates stemming. that reduces the words to their word stem or root.
Let’s take a look at the Topics table output. There are 25 topics created based on groups of terms that occurred together more frequently. Each term-document pair is assigned a score for every topic. Thresholds are then used to determine whether the association is strong enough to consider whether that document or term belongs to the topic. Because of this, terms and documents can belong to multiple topics.
These 25 topics are essentially 25 new columns that are created in the output table. These output columns represent singular value decomposition (SVD) scores which are numeric (interval variables) in nature and will be used in subsequent nodes as inputs. You can also preview the output data and variables table via the Output Data tab, where COL1 to COL25 represents the new features generated by the Text Mining node.
Now just for illustration purposes, let’s add two modeling nodes (Decision tree), one following the Text Mining Node and another directly after the Data node. In the real world, you would want to add multiple models for comparison but for this example, I am going to show a comparison among these two Decision Trees to showcase how new features generated by the Text Mining node were used by the following nodes.
Once the run is complete, let’s take a look at the Model Comparison node. You'll notice that the Decision Tree node -- which was followed by the Text Mining node --- was the champion model in this case. Adding these new text features does not always guarantee performance increase, but it is worth extracting meaningful information from unstructured data to try boosting the model’s performance.
In our case, it clearly has helped improve the Decision Tree’s performance.
Looking at the results from the Decision Tree node followed by the Text Mining node, notice how some of the new features generated by the Text Mining node made it to the top of the variable importance list.
Automated Pipeline Creation
In this example, let's look at how SAS Model Studio can be used to dynamically build a pipeline for you that is based on the data.
When you create a new pipeline, choose the “Automatically generate the pipeline” option.
In our case, Automated Pipeline creation will automatically detect the “text” role assigned to one of the variables and performs data preparation, model building, hyperparameter tuning, model comparisons, and model selection on your data to create a pipeline.
Let's take a look at the final pipeline generated. Note, you can always unlock the pipeline and change the node’s properties to the desired value, and then run the pipeline again for updated results.
Notice, how the Data node is followed by the Text Mining node because we had a variable with a text role available in the data. In this case, Automated Pipeline creation ran several models including the ensemble node to determine the best model. Model Comparison results show that the Gradient Boosting model (which is followed by the Imputation node in the final pipeline) is the champion model, so let's take a peek at its results.
Looking at the Gradient Boosting node (champion model) results, we can see that many of the Text Mining nodes generated features -- that were transformed by the Transformation and Imputation node in the flow -- have made it to the top of the variable importance list here as well.
Lastly, one note when you only have one input variable in the data (in addition to the target variable). If it happens to be a “text” variable, then Automated Pipeline creation would use the new generated interval features (SVD scores) from the Text Mining node for all the subsequent nodes as inputs to create a final pipeline.
Summary
This article presented the importance of gaining useful information from unstructured data using SAS Model Studio’s Text Mining node and its properties overview. It also covered examples of SAS Model Studio Automated pipeline creation, which automatically detects “text” variables and extracts new features for subsequent use.
This article is a glimpse into this neat node available in Model Studio. Hopefully, it will encourage users to take advantage of this node to further enhance their data mining and model building process.
Additional Resources
Video tutorial: Gradient boosting explained
Video tutorial: How to compare models in SAS
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.