Fluorite | Level 6

Proc Hptmine: What is the formula behind assigning a topic to a document

I am using proc hptmine. It generates some documents such as SVD matrices, docrpo, terms, parent, topics... etc.

I joined some of these tables to find the topic assigned for each document. Using term cutoff rate.

However I did not get the same results as the text miner does in E-miner.

Can anyone tell me how can assign a document to a particular text topic. there must be a formula using thresholds to do so.

Thanks

6 REPLIES 6
SAS Employee

Re: Proc Hptmine: What is the formula behind assigning a topic to a document

Here is a quick summary:

For the U factor, it is number-of-terms by number-of-topics, calculate the mean and std deviation per column (topic) of the absolute value of each entry. I believe the default cutoff is 1 standard deviation above the mean. Set every value in abs value below that cutoff to zero. Now reform the document projections from your updated U. Now, repeat the procedure on that result as this time you will be doing it to documents.

Russ

sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Fluorite | Level 6

Re: Proc Hptmine: What is the formula behind assigning a topic to a document

thank you. I will try this and let you know if this works.

Fluorite | Level 6

Re: Proc Hptmine: What is the formula behind assigning a topic to a document

hi again, thanks for your last response. it definitely help with progress. Although this is a good step towards finding topics assigned to documents; I still cannot match the same topics assigned by text miner in E-miner.

instead of using U matrix for the calculations you mentioned above I used the DOCPRO output from HPTMINE. Since this U matrix is the projection of terms onto documents. You think that was ok to use it then?

Second question is TOPICS dataset have termcutoff rates in the list. Can I use those rates in conjunction with V matrix whether those rates are above the rates in the V matrix? Or those cutoff rates need to be compared to some other values?

SAS Employee

Re: Proc Hptmine: What is the formula behind assigning a topic to a document

You have to truncate the U matrix using the technique i described then reform the docpro data set. PROC HPTMINE does not do this. The process has quite a few steps and may be a challenge to re-implement. Have you considered just saving out the sas code from your flow? Depending on what your trying to accomplish, this code will allow you to submit the whole flow programatically.

You would apply the termcutoffs to U, not V. U is number terms by number of topics. Then you reform docpro and then apply docutffs to docpro.

sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Fluorite | Level 6

Re: Proc Hptmine: What is the formula behind assigning a topic to a document

hi,

last time you mentioned mprint option but that did not give me much idea about the code used by miner. there were many macros called. Are you talking about using the SAS Code node in the  e-miner that needs to be connected text topic node?

I have not tried that before. Can you please tell me how or where to find instructions on saving the flow code?

thanks

SAS Employee

Re: Proc Hptmine: What is the formula behind assigning a topic to a document

I am talking about built in macros that are called to do parts of the computation that the procedure does not do. If you right click on a node in your flow and choose "Export path to sas code" you can save the code that is run when your flow runs. If you look at that code you will see the names of these macros. Also, if you add

options mprint;

when you run the path, you will see a printout of many of the macros executing. The actual source of these  macros  is not visible otherwise.

Russ