Re: SAS Text Analytics Text Topic Node

JinHong · Posted 11-11-2019 03:31 AM

Hi all,

I am currently using SAS Enterprise Miner 13.2 to conduct some unsupervised machine learning and I noticed that the Text Topic Node failed to classify some of inputted data. May I know if there is a way for the Text Topic Node to classify all the inputted documents?

Also, I would like to ask if there is a way to let SAS Text Topic Node to determine the optimal number of clusters to classify the documents. If this is not possible, how do you guys generally specify the number of clusters. For the moment, I'm using following the number of clusters obtained by the Text Cluster node and I'm unsure if this is recommended.

Thank you guys for your help and have a nice day!

RussAlbright · Posted 11-11-2019 10:33 AM

JinHong,

If you want complete coverage, every document to belong to a topic, you could look at clustering rather than topics. You do have some control of topics with some macro variables that you can set in your startup code. Take a look at these two found in the Text Miner doc under "Macro Variables, Macros, and Functions"

TMM_DOCCUTOFF	0.001	document cutoff value is for any user-created topic. It is used to determine the default document cutoff for user topics (excluding those that are modified multi-term or single-term topics) in the Topic table. Higher values decrease the number of documents assigned to a topic.

TMM_TERM_CUTOFF		cutoff value is for any user-created or multi-term topic. It is used to determine the default term cutoff for user topics (excluding those that are modified multi-term or single-term topics) and for multi-term in the Topic table. Higher values decrease the number of documents assigned to a topic. If this macro variable is set to blank or not set, then the mean topic weight + 1 standard deviation is set for topic cutoff for each topic.

As far as the optimal number of clusters, SAS Text Miner uses a heuristic based on your max number of dimensions and taking a certain percentage explained from that. Ideally we would like to take the percentage from the complete SVD, not the truncated one, but that is computationally not feasible with large text. I always treat this value as one to be tuned, typically along with the entries on my stop list. I experiment with changing the number of topics from 5-25 or so and when i find one that seems useful. I will also look at the descriptive terms for topics and add terms to the stop list that seem non informative given the context. Repeat until you get some useful insights.

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

eserates · Posted 12-04-2019 02:27 PM

in addition to your great explanation can you please tell me how to make use of SVD matrices generated by proc hptmine to discover assigned topics. now the matrix V has numbers to show association strength for each topic. I can sort this and find out which ones have the highest values. However I used the option TOPICS to output topic names and it has only _termcutoff_ rates for each topic. where can find the _documentcutoff_ rates? I guess matrix V (SVDV) can be used to decide whether an ID tied to a particular topic has a membership or not?

Thanks

RussAlbright · Posted 12-05-2019 09:52 AM

See the answer here

https://communities.sas.com/t5/SAS-Text-and-Content-Analytics/Proc-Hptmine-What-is-the-formula-behin...

to this follow up question.

Thanks

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

SAS Text Analytics Text Topic Node