Hello – I’ve been using using SAS Enterprise Miner (EM) for a long time and am currently attempting to transition over to SAS Viya for Learners (Release 8.5, V.03.05). I need some help with something that I can’t seem to figure out! This is going to be a long post to provide enough details…
I’ve set up a Data Mining and Machine Learning (DM&ML) project in SAS Model Studio within SAS Viya for Learners. My pipeline has a Text Mining Node to process the text data and discover text topics. Then I’m feeding that into a Logistic Regression node. The model performs as expected and the text topics are included in the model without any issues. The pipeline is shown below.
Next, I’d like to review the documents that are grouped into each topic. In SAS EM, this is very easy to do using the Interactive Filter Viewer (which is available in the Text Topic Node). However, that does not seem to exist in Model Studio. As far as I can tell, the results don’t even tell you the number of documents that were assigned to each topic.
As a work around, I can look at the Node Output Data (in either the Text Mining or Logistic Regression node). The Node Output Data is shown below too. It contains columns labeled as “scores” for each text topic. I believe these are similar to what SAS EM calls “document weights”. So, if I sort to show the largest scores first, then the documents near the top of my dataset should be associated with that topic. I’ve done this for the first topic in the list in the screenshot.
However, the problem is that I cannot seem to identify any information provided in the results as to what the document cutoffs are for each topic. I believe there should be document cutoffs that are used to determine whether the association is strong enough to consider that the document belongs to the topic. Without knowing the document cutoff values, I can only say that the first few documents in the sorted column of scores are included in the topic, but I don’t know where the end of the topic is. I also don’t how many documents are assigned to each topic.
Does anyone know how to find the document cutoff values and/or a better way to identify the documents associated with each topic in a DM&ML project?
I’ve already checked the documentation, and there isn’t much detail provided for the Text Mining node. Any help would be greatly appreciated!
Pipeline
Node Output Data
Thanks for your reply! To give you more details about my problem set up, I have airline customer comments data. I’d like to run a binary logistic regression (and possibly some other models too). My target is a binary variable where customers have indicated whether they would recommend the airline or not. I have a text variable that’s a long string of free-form text that the customer wrote about their trip experience. For my input variables, I’d like to use the text variable in addition to a few nominal variables about the trip type (for example, the class they flew in which is economy, premium economy, first class). Ideally, I’d like to take the route of what was available in SAS EM, where I partition the data, run several different models, and then use a Model Comparison node to identify a champion.
Yes, I do realize there are more text nodes available if you create a Project with Type = Text Analytics. Thanks for mentioning that. I did initially go that route to learn more about Viya and how it compares to EM. The Topics node in that type of project does have a nice interface that does exactly what I’m looking for. However, it has several other limitations that I couldn’t find a workaround for. For example, it doesn’t look like any predictive modeling nodes are available (with the exception of the Categories node). Also, I don’t see a way to partition the data. So, it doesn’t seem like there’s a way in that type of project to do what I’m looking to do, unless I’m missing something.
Hope that gives you more context. Let me know if you have more questions. I appreciate your help!
My understanding of the Text Mining node in DMML is that it just assigns SVD scores to each observation (document). So technically, a particular document is not assigned to a topic, it will receive a score for each topic.
Example:
Let's say the TM node identified 3 topics - 3 new variables will be added (COL1, COL2, COL3 with labels identifying the topic) & each observation will receive a score for each topic - so it is possible for a document to get scores like:
COL1 = .33 COL2 = .33 COL3 = .33
You could write some SAS code to "Assign" the document to the topic that has the highest score:
if COL1 > COL2 and COL1 > COL3 then Topic = 1;
else if COL2 > COL1 and COL2 > COL3 then Topic = 2;
else if COL3 > COL1 and COL3 >COL2 then Topic = 3;
else Topic = .;
(probably a slicker way to do this)
Maybe this is no help.
Yes, I believe you are correct about the Text Mining node in DMML assigning SVD scores to each observation (document). I don’t think assigning a document to the topic where it has the largest score will give me what I’m looking for. The software documentation notes that documents can belong to multiple topics, so it seems like I need a different method. I do really appreciate your input and ideas though!
The process was fairly transparent in SAS EM. It would take the scores and use a document cutoff value to determine if a score was high enough to say that a document belonged to a particular topic. The cutoff values were automatically calculated in the software but shown to the user in the Interactive Filter Viewer. I’m not sure if it helps, but I’m attaching a screenshot of what I’m talking about in SAS EM. In the screenshot, if the Topic Weight for a document (shown in the Documents window) is larger than the Document Cutoff value (shown in the Topics window) then the association is considered strong enough to say that the document belongs to the topic. Being able to review the documents associated with a topic is really helpful for understanding the main idea behind a topic. Otherwise, you’re depending on the 5 terms shown in the topic name to try to understand what the topic is about, and that’s not much to go on sometimes.
I was thinking that SAS Model Studio was doing something similar based on what it says in the software documentation, “thresholds are then used to determine whether the association is strong enough to consider if that document or term belongs in the topic. As a result of this, terms and documents can belong to multiple topics”. However, I can’t seem to find those thresholds that it mentions anywhere.
Sorry for all the long posts! I’ve been puzzling over this for days.
I’m posting a final reply to answer my original question. I hope this might help someone else in the future, so I’m going to include a lot of details…
Background: In my original post, I described a limitation of using a Data Mining and Machine Learning (DM&ML) project in SAS Model Studio for text analytics. The Text Mining Node does not provide a way to review the documents (observations) that are associated with each topic. I also previously described a work around using the Node Output Data from my project. The output data contains columns labeled as “scores” for each text topic. If I sort to show the largest scores first, then the documents near the top of the dataset are associated with that topic (I showed a screenshot of this in my original post.) The problem is that I didn’t know where documents associated with a topic end in the sorted list. Based on what I knew from SAS Enterprise Miner, I was looking for a document cutoff value where I could say that all documents with scores greater than the cutoff value could be considered to belong to the topic.
Solution: If you create a Text Analytics project (rather than a DM&ML project) in Model Studio, then there’s a Topics Node. That node does have a nice interface that shows you the documents that are assigned to each topic. Unfortunately, that type of project has some other limitations that I couldn’t get around (no data partition and limited nodes for predictive models), which is why I’m not using it. However, the documentation for the Topics Node describes how it assigns documents to topics. One of its properties is Document Density, which “affects the cutoff for each topic in a way similar to term density. Documents are assigned to a topic if the absolute value of the document weight is above the cutoff. The document density specifies how many standard deviations above the mean of the weights to set the document cutoff.” The default value is one (or one standard deviation above the mean).
So to answer my original question, for each of the columns labeled as “scores” in the node output, I can calculate the mean plus one standard deviation. That will give me a document cutoff value to use for each topic. My solution wasn’t pretty, so I won’t show it here (basically, I downloaded the output data, opened it in Excel, and made the calculations). My results do look reasonable and similar to what I found in SAS EM.
Of course, I’m assuming that the Text Mining node in a DM&ML project works in a similar way as the Topics node in a Text Analytics project. But, that’s the best I can do for now. So, there you have it. I’d love to hear from others in the future if anyone else tries this out. Thanks to @tom_grant for giving me the idea to consider calculations for assigning documents to topics using their scores!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Ready to level-up your skills? Choose your own adventure.