<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Text Mining small obs but large text in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132100#M9347</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Have you ever tried to run a cluster analysis on text topics?&amp;nbsp; I am trying to come up with a way to identify changes in topics over time and think this may be an approach to consider.&amp;nbsp; But, I am having a hard time figuring out where to start. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have weekly (could be daily or monthly) collection of documents that I run through SAS Text Miner including the Text Topic node.&amp;nbsp; The end result is a number of multi-term topics identified by SAS Text Miner running unsupervised.&amp;nbsp; I would like to compare one week to another to see what is changing.&amp;nbsp; The Text Topic results show me the topics with the first 5 terms, but I know that there are additional terms in each topic.&amp;nbsp; Could you use these terms (the first 5 or all of them) in a cluster analysis to see how similar they are to the topics generated in the following week?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is there a more obvious approach that I am missing?&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Tue, 20 Aug 2013 16:49:00 GMT</pubDate>
    <dc:creator>BradHaines</dc:creator>
    <dc:date>2013-08-20T16:49:00Z</dc:date>
    <item>
      <title>Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132095#M9342</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Does anyone have any tips on using Text Topics to discover themes for a small amount of lengthy documents?&lt;/P&gt;&lt;P&gt;I have 25 documents.&amp;nbsp; Each have about 35 subsections across 4 main sections.&amp;nbsp; A subsection is usually a couple paragraphs on average.&amp;nbsp; One of the documents is 205K characters. The smallest is 20K.&amp;nbsp; Most hover around 70K. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;My goal is to look for similarities across documents and/or sections and/or subsections.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Text Topic is likely the node of choice.&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Has anyone had any experience using TM on a small data set like this?&amp;nbsp; Is the exercise futile?&amp;nbsp; Does anyone have any suggestions?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 11 Jul 2013 16:14:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132095#M9342</guid>
      <dc:creator>jaredp</dc:creator>
      <dc:date>2013-07-11T16:14:25Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132096#M9343</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I do not see any problem with the number of documents.&amp;nbsp; Depends on your objective. If you are trying to identify similarities between sub-sections, then write a small code to extract each sub-section and treat it as a single document. That will give you enough observations (documents): 25*35.&lt;/P&gt;&lt;P&gt;If all your documents follow a standard template then you can easily use PERL regular expressions to extract each sub-section and create a new data set with sub-sections as records. &lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 01 Aug 2013 17:34:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132096#M9343</guid>
      <dc:creator>sgarla</dc:creator>
      <dc:date>2013-08-01T17:34:16Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132097#M9344</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I appreciate the follow-up.&amp;nbsp; That's what I ended up doing was breaking things down by subsections.&amp;nbsp; I get much better results this way.&amp;nbsp; You hit the nail on the head with "Depends on your objective".&amp;nbsp; Once I stood back to look at the main objectives, it became much clearer on how the data could be reshaped for analysis.&amp;nbsp; &lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 02 Aug 2013 16:39:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132097#M9344</guid>
      <dc:creator>jaredp</dc:creator>
      <dc:date>2013-08-02T16:39:50Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132098#M9345</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&lt;A __default_attr="355042" __jive_macro_name="user" class="jive_macro jive_macro_user" data-objecttype="3" href="https://communities.sas.com/"&gt;&lt;/A&gt;: Looks like you already came up with a way to reshape the data.&amp;nbsp; But do you really want to stop at topics or do you actually want to run a cluster analysis on the topics?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 02 Aug 2013 17:03:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132098#M9345</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2013-08-02T17:03:20Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132099#M9346</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;For the time being we are focusing on topics. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Originally, I was unsure if the cluster analysis would be beneficial.&amp;nbsp; At that time my data was wide (35 vars, 25 obs).&amp;nbsp; But when I transposed the dataset to treat each document as a variable, I began thinking that clustering may reveal some common themes across the sections - this is one of the objectives of my analysis.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Truthfully, to answer your question, I'd have to say "I don't know".&amp;nbsp; &lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 02 Aug 2013 20:02:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132099#M9346</guid>
      <dc:creator>jaredp</dc:creator>
      <dc:date>2013-08-02T20:02:40Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132100#M9347</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Have you ever tried to run a cluster analysis on text topics?&amp;nbsp; I am trying to come up with a way to identify changes in topics over time and think this may be an approach to consider.&amp;nbsp; But, I am having a hard time figuring out where to start. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have weekly (could be daily or monthly) collection of documents that I run through SAS Text Miner including the Text Topic node.&amp;nbsp; The end result is a number of multi-term topics identified by SAS Text Miner running unsupervised.&amp;nbsp; I would like to compare one week to another to see what is changing.&amp;nbsp; The Text Topic results show me the topics with the first 5 terms, but I know that there are additional terms in each topic.&amp;nbsp; Could you use these terms (the first 5 or all of them) in a cluster analysis to see how similar they are to the topics generated in the following week?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is there a more obvious approach that I am missing?&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 20 Aug 2013 16:49:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132100#M9347</guid>
      <dc:creator>BradHaines</dc:creator>
      <dc:date>2013-08-20T16:49:00Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132101#M9348</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hmmm..using the first 5 terms...&amp;nbsp; One question that comes to mind is What if there is a shift in the use of one term for another, but they are synonyms?&amp;nbsp; The approach might work with a growing synonym list?&amp;nbsp; But this is no longer unsupervised.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You can run, in tandem, the Text Topic and Text Cluster nodes.&amp;nbsp; This will give you your Topics as well as generated SVD values.&amp;nbsp; I'm not an expert with Singular Value Decomposition (SVD), but I have a strong sense that if you want to measure changes in your corpus over time, then a solution might be to use the SVD values (i.e., TextCluster_SVD1, TextCluster_SVD2...TextCluster_SVDn).&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This paper might have some similarities to what you want to do: &lt;A href="http://www.scsug.org/SCSUGProceedings/2009/Liang_Xie1.pdf" title="http://www.scsug.org/SCSUGProceedings/2009/Liang_Xie1.pdf"&gt;http://www.scsug.org/SCSUGProceedings/2009/Liang_Xie1.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;One can brush up on SVD here: &lt;A href="http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf" title="http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf"&gt;http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf&lt;/A&gt; &lt;/P&gt;&lt;P&gt;and some nice insight here too: &lt;A href="http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf" title="http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf"&gt;http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'd love if you kept us informed about any solutions you apply.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 20 Aug 2013 17:33:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132101#M9348</guid>
      <dc:creator>jaredp</dc:creator>
      <dc:date>2013-08-20T17:33:13Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132102#M9349</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Doing cluster analysis on the text topics may not be a good idea if you want to understand the trend in topics. Cluster analysis would group your records and that means solving a different problem.&lt;/P&gt;&lt;P&gt;If I understand your objective correctly, I think you can achieve it by creating user-defined topics rather than trying to track the system-generated topics.&lt;/P&gt;&lt;P&gt;First run text topic node on your first set of comments (Day 1/Week 1/Month1), then look at the multi-term topics. From this you will get a business sense of what topics are generated. If they do not make sense, modify them and create them as user-defined topics.&lt;/P&gt;&lt;P&gt;Say you end up with user-defined topics like,&lt;/P&gt;&lt;P&gt;Topic 1: +big data,&amp;nbsp; +data, database, high, performance&lt;/P&gt;&lt;P&gt;Topic 2: statistics, +models, visual, analytics, data&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Once you have your topics, for the remaining time periods, you will have to define user-defined topics same as the topics defined in day1/week1/month1 (as above). &lt;/P&gt;&lt;P&gt;For each time period just look at the frequency of documents for each topic. That should give you a sense of how the topics trend over time.&lt;/P&gt;&lt;P&gt;And I guess this a decent approach to start with. &lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 20 Aug 2013 17:53:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132102#M9349</guid>
      <dc:creator>sgarla</dc:creator>
      <dc:date>2013-08-20T17:53:11Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining small obs but large text</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132103#M9350</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&lt;BR /&gt;My primary objective is to find new issues in the data - things that we haven't seen before.&amp;nbsp; So, creating user topics won't work because that would only include things that I have seen before in the data. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I was hoping to use the terms identified in the Text Topics node for the comparison.&amp;nbsp; From one time period to the next you would likely see some topics that are exactly the same (5 terms all the same).&amp;nbsp; You would also see some that changed (2 or 3 terms the same) and then some that are totally new terms.&amp;nbsp; I would like to score each of the topics in the new time period based on how it compares to the prior months. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is just me coming up with a potential solution.&amp;nbsp; There may be another method that I am missing to identify new issues.&amp;nbsp; Any ideas?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 20 Aug 2013 21:21:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Mining-small-obs-but-large-text/m-p/132103#M9350</guid>
      <dc:creator>BradHaines</dc:creator>
      <dc:date>2013-08-20T21:21:42Z</dc:date>
    </item>
  </channel>
</rss>

