<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How can I tokenize a document using Base SAS? in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344441#M9746</link>
    <description>&lt;P&gt;I think the Brown one is considered the best, but I don't think it's free. If you find a free, open source copy, post the link &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 26 Mar 2017 19:16:45 GMT</pubDate>
    <dc:creator>Reeza</dc:creator>
    <dc:date>2017-03-26T19:16:45Z</dc:date>
    <item>
      <title>How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344294#M9735</link>
      <description>&lt;P&gt;Greetings,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. Tokenization is a way to split text into tokens. I would like to do 2&amp;nbsp;things:&lt;/P&gt;&lt;P&gt;1.1 &lt;FONT color="#008000"&gt;Tokenize an entire document&lt;/FONT&gt;&amp;nbsp;(the tokens in my case are words and not phrases or letters).&lt;/P&gt;&lt;P&gt;1.2 &lt;FONT color="#008000"&gt;Remove Stop Words&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. I Saw Cynthia's solutions to count the Frequency of words in a document -&amp;gt; &lt;A href="https://communities.sas.com/t5/SAS-Procedures/Frequency-of-Strings/td-p/41378" target="_blank"&gt;https://communities.sas.com/t5/SAS-Procedures/Frequency-of-Strings/td-p/41378&lt;/A&gt; -&amp;gt; &lt;FONT color="#FF0000"&gt;However&lt;/FONT&gt;, I would like to create tokens and not only to count the frequncy of words&lt;/P&gt;&lt;P&gt;3. I don't have SAS Text Miner or SAS Contextual Analysis for this. I would need to use Base SAS for this task.&lt;/P&gt;&lt;P&gt;4. Any code example or ideas will assist.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;D&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Mar 2017 13:30:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344294#M9735</guid>
      <dc:creator>DanielDor</dc:creator>
      <dc:date>2017-03-25T13:30:56Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344321#M9736</link>
      <description>&lt;P&gt;FWIW: There are a number of R packages that do what you want. See, e.g., :&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.google.ca/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=2&amp;amp;cad=rja&amp;amp;uact=8&amp;amp;ved=0ahUKEwjrmp3Xp_LSAhUJw4MKHZ-wAFQQFggvMAE&amp;amp;url=https%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2FNLP%2FNLP.pdf&amp;amp;usg=AFQjCNHS3vDZ5VmOMtnUUmIVVY8FPEN84g&amp;amp;sig2=8hXJPCVQBeHKEcmwTh8w-g" target="_blank"&gt;https://www.google.ca/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=2&amp;amp;cad=rja&amp;amp;uact=8&amp;amp;ved=0ahUKEwjrmp3Xp_LSAhUJw4MKHZ-wAFQQFggvMAE&amp;amp;url=https%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2FNLP%2FNLP.pdf&amp;amp;usg=AFQjCNHS3vDZ5VmOMtnUUmIVVY8FPEN84g&amp;amp;sig2=8hXJPCVQBeHKEcmwTh8w-g&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.google.ca/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=3&amp;amp;cad=rja&amp;amp;uact=8&amp;amp;ved=0ahUKEwjrmp3Xp_LSAhUJw4MKHZ-wAFQQFgg2MAI&amp;amp;url=https%3A%2F%2Fwww.rdocumentation.org%2Fpackages%2FkoRpus%2Fversions%2F0.06-5%2Ftopics%2Ftokenize&amp;amp;usg=AFQjCNFweIdY0yRpCCDnSmESQsRMaVj2rw&amp;amp;sig2=C9-XugZZc01bghI6YzWy-g" target="_blank"&gt;https://www.google.ca/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=3&amp;amp;cad=rja&amp;amp;uact=8&amp;amp;ved=0ahUKEwjrmp3Xp_LSAhUJw4MKHZ-wAFQQFgg2MAI&amp;amp;url=https%3A%2F%2Fwww.rdocumentation.org%2Fpackages%2FkoRpus%2Fversions%2F0.06-5%2Ftopics%2Ftokenize&amp;amp;usg=AFQjCNFweIdY0yRpCCDnSmESQsRMaVj2rw&amp;amp;sig2=C9-XugZZc01bghI6YzWy-g&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;and&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.google.ca/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=4&amp;amp;cad=rja&amp;amp;uact=8&amp;amp;ved=0ahUKEwjrmp3Xp_LSAhUJw4MKHZ-wAFQQFgg7MAM&amp;amp;url=http%3A%2F%2Fwww.mjdenny.com%2FText_Processing_In_R.html&amp;amp;usg=AFQjCNF_1qqMeGZGSVVoCKH8Z5P6QFzvoQ&amp;amp;sig2=9KDcJWxO8mSA0SKBENnPNw" target="_blank"&gt;https://www.google.ca/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=4&amp;amp;cad=rja&amp;amp;uact=8&amp;amp;ved=0ahUKEwjrmp3Xp_LSAhUJw4MKHZ-wAFQQFgg7MAM&amp;amp;url=http%3A%2F%2Fwww.mjdenny.com%2FText_Processing_In_R.html&amp;amp;usg=AFQjCNF_1qqMeGZGSVVoCKH8Z5P6QFzvoQ&amp;amp;sig2=9KDcJWxO8mSA0SKBENnPNw&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Art, CEO, AnalystFinder.com&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Mar 2017 18:47:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344321#M9736</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2017-03-25T18:47:39Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344326#M9737</link>
      <description>&lt;P&gt;Thanks a lot for the info! But I would like to do this in SAS &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Mar 2017 19:12:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344326#M9737</guid>
      <dc:creator>DanielDor</dc:creator>
      <dc:date>2017-03-25T19:12:58Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344327#M9738</link>
      <description>&lt;P&gt;There is a no longer documented procedure that can identify all of the words and obtain frequency distributions of those words .. which would at least provide a base SAS starting point.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I was going to mention it in my last post, but it no longer worked in SAS 9.4 .. at least on SAS University Edition.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;However, I just received a note from someone that it was still working in SAS 9.3.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My only suggestion would be to try it. It's called PROC SPELL. Here is an example of how it can be run:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;options caps;
filename temp temp;
data _null_;
  file temp;
  informat sentence $100.;
  input sentence &amp;amp;;
  put sentence;
  cards;
Let's see if sas spell procdure can be used
to verify whether tha seperate words in
this, uhm, flie are, uhm, valid against a
stantard internal dictionary and let’s see
how versatile it is
;

proc spell in=temp nomaster verify;
run;
&lt;/PRE&gt;
&lt;P&gt;HTH,&lt;/P&gt;
&lt;P&gt;Art, CEO, AnalystFinder.com&lt;/P&gt;</description>
      <pubDate>Sat, 25 Mar 2017 19:22:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344327#M9738</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2017-03-25T19:22:01Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344329#M9739</link>
      <description>&lt;P&gt;p.s. Just got confirmation from some of my colleagues that PROC SPELL does indeed work on "regular" versions of SAS 9.3 and 9.4.&lt;/P&gt;
&lt;P&gt;For some reason it just isn't working on SAS UE&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Art, CEO, AnalystFinder.com&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Mar 2017 19:44:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344329#M9739</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2017-03-25T19:44:32Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344331#M9740</link>
      <description>&lt;P&gt;Here's one way to separate a document, which I'm assuming you've already imported into SAS.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can find 'corpus' online that include parts of speech or sentiment and then use those to help classify the words. As&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13711"&gt;@art297&lt;/a&gt;&amp;nbsp;has indicated there's an old proc (unsupported) that can help with this.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I store the code here, which is a bit more than what's below, but if you have a corpus read iin, it may be useful. Hope this helps somewhat.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://github.com/statgeek/SAS-Tutorials/blob/master/text_analysis" target="_blank"&gt;https://github.com/statgeek/SAS-Tutorials/blob/master/text_analysis&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;*Create sample data;
data random_sentences;
    infile cards truncover;
    informat sentence $256.;
    input sentence $256.;
    cards;
This is a random sentence
This is another random sentence
Happy Birthday
My job sucks.
This is a good idea, not.
This is an awesome idea!
How are you today?
Does this make sense?
Have a great day!
;
    ;
    ;
    ;

*Partition into words;
data f1;
    set random_sentences;
    id=_n_;
    nwords=countw(sentence);
    nchar=length(compress(sentence));

    do word_order=1 to nwords;
        word=scan(sentence, word_order);
        output;
    end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Mar 2017 20:21:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344331#M9740</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2017-03-25T20:21:53Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344332#M9741</link>
      <description>Hi Art,&lt;BR /&gt;&lt;BR /&gt;Thanks again for the assistance and the fast replies.&lt;BR /&gt;&lt;BR /&gt;I've tried PROC SPELL on SAS9.4 and it works. However, it writes the results to report. Do you know by chance if I have a way to write the results to a dataset instead?&lt;BR /&gt;&lt;BR /&gt;Thanks a lot!&lt;BR /&gt;&lt;BR /&gt;D</description>
      <pubDate>Sat, 25 Mar 2017 20:27:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344332#M9741</guid>
      <dc:creator>DanielDor</dc:creator>
      <dc:date>2017-03-25T20:27:22Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344335#M9742</link>
      <description>Hi Reeza,&lt;BR /&gt;&lt;BR /&gt;Thanks for your reply! &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; In your opinion, what would be the best way to take the F1 dataset, and tokenize or stem it? I need for example that words "goes"/"going"/"go"/"went" will be under the same concept ("go").&lt;BR /&gt;&lt;BR /&gt;Thanks!&lt;BR /&gt;&lt;BR /&gt;D</description>
      <pubDate>Sat, 25 Mar 2017 20:30:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344335#M9742</guid>
      <dc:creator>DanielDor</dc:creator>
      <dc:date>2017-03-25T20:30:43Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344336#M9743</link>
      <description>&lt;P&gt;You need a mapping document, like a 'corpus' that I mentioned, that maps them to the 'root' of the word. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;Once you have those documents set up you can merge the data and assign them to the same group.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Some of these mapping documents are open source but many are not.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Mar 2017 20:33:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344336#M9743</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2017-03-25T20:33:31Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344341#M9744</link>
      <description>&lt;P&gt;I don't think SAS would appreciate my posting it here, but I happen to have a copy of the SPELL procedure's documentation.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'd be glad to answer any of your questions off-line. Send me a note to: art@analystfinder.com&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;There isn't an output option, but the documentation says to just save the output and use it as input back to SAS or any text editor. Of course, these days you can accomplish that using proc printto.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can also create your own dictionaries, thus could create a dictionary of stop words, tokens and whatever.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Art, CEO, AnalystFinder.com&lt;/P&gt;</description>
      <pubDate>Sat, 25 Mar 2017 21:18:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344341#M9744</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2017-03-25T21:18:15Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344391#M9745</link>
      <description>Hi Reeza,&lt;BR /&gt;&lt;BR /&gt;Thanks for the information. Are you familiar with a good and effective Corpus which I can use for this Tokenization?&lt;BR /&gt;&lt;BR /&gt;Thanks!&lt;BR /&gt;&lt;BR /&gt;D</description>
      <pubDate>Sun, 26 Mar 2017 07:17:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344391#M9745</guid>
      <dc:creator>DanielDor</dc:creator>
      <dc:date>2017-03-26T07:17:08Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344441#M9746</link>
      <description>&lt;P&gt;I think the Brown one is considered the best, but I don't think it's free. If you find a free, open source copy, post the link &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 26 Mar 2017 19:16:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344441#M9746</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2017-03-26T19:16:45Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344985#M9747</link>
      <description>You may try PROC HPTMINE.</description>
      <pubDate>Tue, 28 Mar 2017 14:07:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/344985#M9747</guid>
      <dc:creator>EricwenLiu</dc:creator>
      <dc:date>2017-03-28T14:07:12Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/648750#M9748</link>
      <description>&lt;P&gt;Is there any solution to your problem?&lt;/P&gt;</description>
      <pubDate>Tue, 19 May 2020 03:51:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/648750#M9748</guid>
      <dc:creator>Rajaram</dc:creator>
      <dc:date>2020-05-19T03:51:46Z</dc:date>
    </item>
    <item>
      <title>Re: How can I tokenize a document using Base SAS?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/648751#M9749</link>
      <description>PROC HPTMINE is not part of Base SAS</description>
      <pubDate>Tue, 19 May 2020 03:52:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-can-I-tokenize-a-document-using-Base-SAS/m-p/648751#M9749</guid>
      <dc:creator>Rajaram</dc:creator>
      <dc:date>2020-05-19T03:52:58Z</dc:date>
    </item>
  </channel>
</rss>

