<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: term-document frequency matrix from textmine is ignoring small documents with one or two words in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/term-document-frequency-matrix-from-textmine-is-ignoring-small/m-p/943140#M10920</link>
    <description>&lt;P&gt;On the PARSE statement, please specify :&lt;/P&gt;
&lt;UL class="lia-list-style-type-square"&gt;
&lt;LI&gt;&lt;STRONG&gt;NONOUNGROUPS&lt;/STRONG&gt; | NONG&amp;nbsp; :&amp;nbsp; Suppresses noun group extraction in parsing&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;NOSTEMMING&lt;/STRONG&gt;&amp;nbsp; :&amp;nbsp; Suppresses stemming in parsing&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;NOTAGGING&lt;/STRONG&gt;&amp;nbsp; :&amp;nbsp; Suppresses part-of-speech tagging in parsing&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;SHOWDROPPEDTERMS&lt;/STRONG&gt;&amp;nbsp; :&amp;nbsp; Includes dropped terms in the OUTTERMS= data table&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Still the same problem?&amp;nbsp;&lt;BR /&gt;Are you sure the 1 or 2 words in the small documents (the ones you are missing in the output tables) are not in the stop list?&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;BR, Koen&lt;/P&gt;</description>
    <pubDate>Mon, 09 Sep 2024 13:43:55 GMT</pubDate>
    <dc:creator>sbxkoenk</dc:creator>
    <dc:date>2024-09-09T13:43:55Z</dc:date>
    <item>
      <title>term-document frequency matrix from textmine is ignoring small documents with one or two words</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/term-document-frequency-matrix-from-textmine-is-ignoring-small/m-p/942419#M10918</link>
      <description>&lt;P&gt;I have used textmine to find the frequencies of various words for each of a set of documents. Each document is a free text field that consists of one or more words- this free text field corresponds to one of the columns. There is also an Index field that indicates the document number along with various other columns.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I used the following code to output the term-document matrix:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="William29_0-1725423083486.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/99945i087A99B9CC09CC28/image-size/medium?v=v2&amp;amp;px=400" role="button" title="William29_0-1725423083486.png" alt="William29_0-1725423083486.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I noticed that when looking at the term document matrix (the file mentioned in outchild), that if I search for the 'Index' column value for a document with or one (and sometimes two words) within the 'Document' column of the term-document frequency matrix, that often it cannot be found at all. It appears that textmine is not even processing this columns.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is this a known feature of textmine (is it supposed to be doing this)?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there a simple option within textmine to stop it doing this (as opposed to editing the documents and adding a lot of place-filler words to increase the document length)?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am wondering whether this might have to do with the fact that the algorithm cannot identify whether the words are verbs or nouns (which is something the algorithm does) when there are very few words and so it ignores the document altogether?&lt;/P&gt;</description>
      <pubDate>Wed, 04 Sep 2024 04:20:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/term-document-frequency-matrix-from-textmine-is-ignoring-small/m-p/942419#M10918</guid>
      <dc:creator>William29</dc:creator>
      <dc:date>2024-09-04T04:20:28Z</dc:date>
    </item>
    <item>
      <title>Re: term-document frequency matrix from textmine is ignoring small documents with one or two words</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/term-document-frequency-matrix-from-textmine-is-ignoring-small/m-p/943140#M10920</link>
      <description>&lt;P&gt;On the PARSE statement, please specify :&lt;/P&gt;
&lt;UL class="lia-list-style-type-square"&gt;
&lt;LI&gt;&lt;STRONG&gt;NONOUNGROUPS&lt;/STRONG&gt; | NONG&amp;nbsp; :&amp;nbsp; Suppresses noun group extraction in parsing&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;NOSTEMMING&lt;/STRONG&gt;&amp;nbsp; :&amp;nbsp; Suppresses stemming in parsing&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;NOTAGGING&lt;/STRONG&gt;&amp;nbsp; :&amp;nbsp; Suppresses part-of-speech tagging in parsing&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;SHOWDROPPEDTERMS&lt;/STRONG&gt;&amp;nbsp; :&amp;nbsp; Includes dropped terms in the OUTTERMS= data table&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Still the same problem?&amp;nbsp;&lt;BR /&gt;Are you sure the 1 or 2 words in the small documents (the ones you are missing in the output tables) are not in the stop list?&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;BR, Koen&lt;/P&gt;</description>
      <pubDate>Mon, 09 Sep 2024 13:43:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/term-document-frequency-matrix-from-textmine-is-ignoring-small/m-p/943140#M10920</guid>
      <dc:creator>sbxkoenk</dc:creator>
      <dc:date>2024-09-09T13:43:55Z</dc:date>
    </item>
  </channel>
</rss>

