topic Re: Text Mining small obs but large text in SAS Data Science

Text Mining small obs but large text

jaredp — Thu, 11 Jul 2013 16:14:25 GMT

Does anyone have any tips on using Text Topics to discover themes for a small amount of lengthy documents?

I have 25 documents. Each have about 35 subsections across 4 main sections. A subsection is usually a couple paragraphs on average. One of the documents is 205K characters. The smallest is 20K. Most hover around 70K.

My goal is to look for similarities across documents and/or sections and/or subsections.

Text Topic is likely the node of choice.

Has anyone had any experience using TM on a small data set like this? Is the exercise futile? Does anyone have any suggestions?

Re: Text Mining small obs but large text

sgarla — Thu, 01 Aug 2013 17:34:16 GMT

I do not see any problem with the number of documents. Depends on your objective. If you are trying to identify similarities between sub-sections, then write a small code to extract each sub-section and treat it as a single document. That will give you enough observations (documents): 25*35.

If all your documents follow a standard template then you can easily use PERL regular expressions to extract each sub-section and create a new data set with sub-sections as records.

Re: Text Mining small obs but large text

jaredp — Fri, 02 Aug 2013 16:39:50 GMT

I appreciate the follow-up. That's what I ended up doing was breaking things down by subsections. I get much better results this way. You hit the nail on the head with "Depends on your objective". Once I stood back to look at the main objectives, it became much clearer on how the data could be reshaped for analysis.

Re: Text Mining small obs but large text

art297 — Fri, 02 Aug 2013 17:03:20 GMT

: Looks like you already came up with a way to reshape the data. But do you really want to stop at topics or do you actually want to run a cluster analysis on the topics?

Re: Text Mining small obs but large text

jaredp — Fri, 02 Aug 2013 20:02:40 GMT

For the time being we are focusing on topics.

Originally, I was unsure if the cluster analysis would be beneficial. At that time my data was wide (35 vars, 25 obs). But when I transposed the dataset to treat each document as a variable, I began thinking that clustering may reveal some common themes across the sections - this is one of the objectives of my analysis.

Truthfully, to answer your question, I'd have to say "I don't know".

Re: Text Mining small obs but large text

BradHaines — Tue, 20 Aug 2013 16:49:00 GMT

Have you ever tried to run a cluster analysis on text topics? I am trying to come up with a way to identify changes in topics over time and think this may be an approach to consider. But, I am having a hard time figuring out where to start.

I have weekly (could be daily or monthly) collection of documents that I run through SAS Text Miner including the Text Topic node. The end result is a number of multi-term topics identified by SAS Text Miner running unsupervised. I would like to compare one week to another to see what is changing. The Text Topic results show me the topics with the first 5 terms, but I know that there are additional terms in each topic. Could you use these terms (the first 5 or all of them) in a cluster analysis to see how similar they are to the topics generated in the following week?

Is there a more obvious approach that I am missing?

Thanks.

Re: Text Mining small obs but large text

jaredp — Tue, 20 Aug 2013 17:33:13 GMT

Hmmm..using the first 5 terms... One question that comes to mind is What if there is a shift in the use of one term for another, but they are synonyms? The approach might work with a growing synonym list? But this is no longer unsupervised.

You can run, in tandem, the Text Topic and Text Cluster nodes. This will give you your Topics as well as generated SVD values. I'm not an expert with Singular Value Decomposition (SVD), but I have a strong sense that if you want to measure changes in your corpus over time, then a solution might be to use the SVD values (i.e., TextCluster_SVD1, TextCluster_SVD2...TextCluster_SVDn).

This paper might have some similarities to what you want to do: http://www.scsug.org/SCSUGProceedings/2009/Liang_Xie1.pdf

One can brush up on SVD here: http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf

and some nice insight here too: http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf

I'd love if you kept us informed about any solutions you apply.

Re: Text Mining small obs but large text

sgarla — Tue, 20 Aug 2013 17:53:11 GMT

Doing cluster analysis on the text topics may not be a good idea if you want to understand the trend in topics. Cluster analysis would group your records and that means solving a different problem.

If I understand your objective correctly, I think you can achieve it by creating user-defined topics rather than trying to track the system-generated topics.

First run text topic node on your first set of comments (Day 1/Week 1/Month1), then look at the multi-term topics. From this you will get a business sense of what topics are generated. If they do not make sense, modify them and create them as user-defined topics.

Say you end up with user-defined topics like,

Topic 1: +big data, +data, database, high, performance

Topic 2: statistics, +models, visual, analytics, data

Once you have your topics, for the remaining time periods, you will have to define user-defined topics same as the topics defined in day1/week1/month1 (as above).

For each time period just look at the frequency of documents for each topic. That should give you a sense of how the topics trend over time.

And I guess this a decent approach to start with.

Re: Text Mining small obs but large text

BradHaines — Tue, 20 Aug 2013 21:21:42 GMT

My primary objective is to find new issues in the data - things that we haven't seen before. So, creating user topics won't work because that would only include things that I have seen before in the data.

I was hoping to use the terms identified in the Text Topics node for the comparison. From one time period to the next you would likely see some topics that are exactly the same (5 terms all the same). You would also see some that changed (2 or 3 terms the same) and then some that are totally new terms. I would like to score each of the topics in the new time period based on how it compares to the prior months.

This is just me coming up with a potential solution. There may be another method that I am missing to identify new issues. Any ideas?