BookmarkSubscribeRSS Feed
JinHong
Calcite | Level 5

Hi all,

 

I am currently using SAS Enterprise Miner 13.2 to conduct some unsupervised machine learning and I noticed that the Text Topic Node failed to classify some of inputted data. May I know if there is a way for the Text Topic Node to classify all the inputted documents?

 

Also, I would like to ask if there is a way to let SAS Text Topic Node to determine the optimal number of clusters to classify the documents. If this is not possible, how do you guys generally specify the number of clusters. For the moment, I'm using following the number of clusters obtained by the Text Cluster node and I'm unsure if this is recommended.

 

Thank you guys for your help and have a nice day!

3 REPLIES 3
RussAlbright
SAS Employee

JinHong,

 

If you want complete coverage, every document to belong to a topic, you could look at clustering rather than topics.  You do have some control of topics with some macro variables that you can set in your startup code. Take a look at these two found in the Text Miner doc under "Macro Variables, Macros, and Functions"

 

TMM_DOCCUTOFF
0.001
document cutoff value is for any user-created topic. It is used to determine the default document cutoff for user topics (excluding those that are modified multi-term or single-term topics) in the Topic table. Higher values decrease the number of documents assigned to a topic.
 
 
TMM_TERM_CUTOFF
 
cutoff value is for any user-created or multi-term topic. It is used to determine the default term cutoff for user topics (excluding those that are modified multi-term or single-term topics) and for multi-term in the Topic table. Higher values decrease the number of documents assigned to a topic. If this macro variable is set to blank or not set, then the mean topic weight + 1 standard deviation is set for topic cutoff for each topic.

 

As far as the optimal number of clusters, SAS Text Miner uses a heuristic based on your max number of dimensions and taking a certain percentage explained from that. Ideally we would like to take the percentage from the complete SVD, not the truncated one, but that is computationally not feasible with large text. I always treat this value as one to be tuned, typically along with the entries on my stop list. I experiment with changing the number of topics from 5-25 or so  and when i find one that seems useful. I will also look at the descriptive terms for topics and  add terms to the stop list that seem non informative given the context.  Repeat until you get some useful insights.

 

 


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

eserates
Fluorite | Level 6

in addition to your great explanation can you please tell me how to make use of SVD matrices generated by proc hptmine to discover assigned topics. now the matrix V has numbers to show association strength for each topic. I can sort this and find out which ones have the highest values. However I used the option TOPICS to output topic names and it has only _termcutoff_ rates for each topic. where can find the _documentcutoff_ rates? I guess matrix V (SVDV) can be used to decide whether an ID tied to a particular topic has a membership or not?

Thanks  

RussAlbright
SAS Employee

See the answer here

https://communities.sas.com/t5/SAS-Text-and-Content-Analytics/Proc-Hptmine-What-is-the-formula-behin...

to this follow up question.

Thanks


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1231 views
  • 0 likes
  • 3 in conversation