I’m posting a final reply to answer my original question. I hope this might help someone else in the future, so I’m going to include a lot of details… Background: In my original post, I described a limitation of using a Data Mining and Machine Learning (DM&ML) project in SAS Model Studio for text analytics. The Text Mining Node does not provide a way to review the documents (observations) that are associated with each topic. I also previously described a work around using the Node Output Data from my project. The output data contains columns labeled as “scores” for each text topic. If I sort to show the largest scores first, then the documents near the top of the dataset are associated with that topic (I showed a screenshot of this in my original post.) The problem is that I didn’t know where documents associated with a topic end in the sorted list. Based on what I knew from SAS Enterprise Miner, I was looking for a document cutoff value where I could say that all documents with scores greater than the cutoff value could be considered to belong to the topic. Solution: If you create a Text Analytics project (rather than a DM&ML project) in Model Studio, then there’s a Topics Node. That node does have a nice interface that shows you the documents that are assigned to each topic. Unfortunately, that type of project has some other limitations that I couldn’t get around (no data partition and limited nodes for predictive models), which is why I’m not using it. However, the documentation for the Topics Node describes how it assigns documents to topics. One of its properties is Document Density, which “affects the cutoff for each topic in a way similar to term density. Documents are assigned to a topic if the absolute value of the document weight is above the cutoff. The document density specifies how many standard deviations above the mean of the weights to set the document cutoff.” The default value is one (or one standard deviation above the mean). So to answer my original question, for each of the columns labeled as “scores” in the node output, I can calculate the mean plus one standard deviation. That will give me a document cutoff value to use for each topic. My solution wasn’t pretty, so I won’t show it here (basically, I downloaded the output data, opened it in Excel, and made the calculations). My results do look reasonable and similar to what I found in SAS EM. Of course, I’m assuming that the Text Mining node in a DM&ML project works in a similar way as the Topics node in a Text Analytics project. But, that’s the best I can do for now. So, there you have it. I’d love to hear from others in the future if anyone else tries this out. Thanks to @tom_grant for giving me the idea to consider calculations for assigning documents to topics using their scores!
... View more