<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Text Mining in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911476#M359414</link>
    <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/16723"&gt;@sss&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;HI Harry,&lt;BR /&gt;&lt;BR /&gt;I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.&lt;BR /&gt;&lt;BR /&gt;from the first 1st row, i highlighted the values i'm looking for. &lt;BR /&gt;For ex:&lt;BR /&gt;if the input data is &lt;BR /&gt;ffreg45646&amp;nbsp;%345^$45e 0005353532342 ertehf%43432- 435-- erte0034344&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;the expected output is &lt;BR /&gt;45646 ; 43432; 34344&lt;BR /&gt;&lt;BR /&gt;in this case the delimiter is space&lt;BR /&gt;&lt;BR /&gt;... stuff deleted ...&lt;BR /&gt;&lt;BR /&gt;for 5th delimited value 435-- erte0034344&amp;nbsp;, i would like to extract only last 5 digit because it had number less then 5 length and specialchar and char should be trimmed&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Actually, given that your text has a blank (one of the 4 delimiters you specificed) prior to the "ert",&amp;nbsp; the 5th delimited value is just "435--", and you have a sixth delimited value: "erte0034344", which qualifies "34344" for the result variable since there are 5 contiguous digits after removal of leading zeroes.&lt;/P&gt;</description>
    <pubDate>Sat, 13 Jan 2024 16:34:17 GMT</pubDate>
    <dc:creator>mkeintz</dc:creator>
    <dc:date>2024-01-13T16:34:17Z</dc:date>
    <item>
      <title>Text Mining</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/902024#M356436</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hi All,&lt;/P&gt;
&lt;P&gt;I'm working on a text mining project which has millions of reports.&amp;nbsp; My problem statement is little weird.&amp;nbsp; I have column with alph number column with all possible values.&amp;nbsp; from the given data i would like to select only number which has exactly 5 digits. below are the some criteria&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1. values are separated by comma,colon, semi colon, space&lt;/P&gt;
&lt;P&gt;2. the data i want to fetch might has pre text and post text value&lt;/P&gt;
&lt;P&gt;3. i want to select values which are highlighted in bold color&lt;/P&gt;
&lt;P&gt;4. if the leading numbers as zero(0) and followed by digits which is 5 digit will be selected, example is given in 1st row&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I appreciate your help&amp;nbsp;&lt;/P&gt;
&lt;TABLE width="584px"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="415.828px" height="30px"&gt;5 digit column&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/TD&gt;
&lt;TD width="167.172px" height="30px"&gt;&amp;nbsp;output&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD width="415.828px" height="57px"&gt;ffreg&lt;STRONG&gt;45646&lt;/STRONG&gt; %345^$45e 0005353532342 ertehf%&lt;STRONG&gt;43432&lt;/STRONG&gt;- 435-- erte00&lt;STRONG&gt;34344&amp;nbsp;&lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD width="167.172px" height="57px"&gt;&amp;nbsp;45646;43432;34344&amp;nbsp;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD width="415.828px" height="57px"&gt;&amp;nbsp;4235354 sfd4t5345345 erteyye 45634# ye&lt;STRONG&gt;54545&lt;/STRONG&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/TD&gt;
&lt;TD width="167.172px" height="57px"&gt;54545&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD width="415.828px" height="57px"&gt;00055 0545554 54454545 0000000&lt;STRONG&gt;52345&lt;/STRONG&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/TD&gt;
&lt;TD width="167.172px" height="57px"&gt;52345&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD width="415.828px" height="30px"&gt;wer&lt;STRONG&gt;43423&lt;/STRONG&gt;, 445wsfsf4535, 4344%455&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/TD&gt;
&lt;TD width="167.172px" height="30px"&gt;43423&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 08 Nov 2023 10:57:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/902024#M356436</guid>
      <dc:creator>sss</dc:creator>
      <dc:date>2023-11-08T10:57:28Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/902944#M356841</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/16723"&gt;@sss&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If I understood your question correctly you want to parse these 5 digit numbers then collect them into a comma separated list?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If that's the case then PRXPARSE and PRXNEXT should do the trick.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This code:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;/* imported from .csv file */
data have;
set import;
keep text;
run;

/* print data */
title 'Original Data';
proc print data=have;run;

/* parse data using prxparse and prxnext */
data want;
set have;
length out found $20;
re = prxparse("/(\d\d\d\d\d)/");
start = 1;
stop = length(text);
call prxnext(re,start,stop,text,position,length);
do while (position &amp;gt; 0);
out = substr(text,position,length);
found = catx(', ',found,out);
put out;
call prxnext(re,start,stop,text,position,length);

end;
keep text found;
run;

/* print output */
title 'Parsed Data';
proc print data=want;run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Produces this output:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="HarrySnart_0-1699958689540.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/89729i7E774F5A34957C1B/image-size/medium?v=v2&amp;amp;px=400" role="button" title="HarrySnart_0-1699958689540.png" alt="HarrySnart_0-1699958689540.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I hope that helps&lt;/P&gt;
&lt;P&gt;Harry&lt;/P&gt;</description>
      <pubDate>Tue, 14 Nov 2023 10:45:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/902944#M356841</guid>
      <dc:creator>HarrySnart</dc:creator>
      <dc:date>2023-11-14T10:45:55Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/903144#M356889</link>
      <description>HI Harry,&lt;BR /&gt;&lt;BR /&gt;I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.&lt;BR /&gt;&lt;BR /&gt;from the first 1st row, i highlighted the values i'm looking for. &lt;BR /&gt; For ex:&lt;BR /&gt;if the input data is &lt;BR /&gt;ffreg45646&amp;nbsp;%345^$45e 0005353532342 ertehf%43432- 435-- erte0034344&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;the expected output is &lt;BR /&gt;45646 ; 43432; 34344&lt;BR /&gt;&lt;BR /&gt;in this case the delimiter is space&lt;BR /&gt;from the first delimited value&lt;BR /&gt;ffreg45646 i 'want to have value 45646. as it has length of 5 digits continuously.&lt;BR /&gt;  &lt;BR /&gt;from 2nd delimited value %345^$45e  it  has special character in between and the length 3 char and 2 char which not a valid value for me. it should continuous 5 digits in this case it has special character in between. &lt;BR /&gt;&lt;BR /&gt;from 3rd delimited value 0005353532342 we need to ignore all preceding zeros(0) and the length should be equal to 5 however it have length 10. this also doesn't meet my requirement so it should be excluded from output&lt;BR /&gt;&lt;BR /&gt;from 4th delimited value ertehf%43432 it has char and special char both should be trimmed and leftover numbers are equal to exactly 5 , which meets the requirement hence should be part of output&lt;BR /&gt;&lt;BR /&gt;for 5th delimited value  435-- erte0034344&amp;nbsp;, i would like to extract only last 5 digit because it had number less then 5 length and specialchar and  char should be trimmed</description>
      <pubDate>Wed, 15 Nov 2023 08:57:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/903144#M356889</guid>
      <dc:creator>sss</dc:creator>
      <dc:date>2023-11-15T08:57:15Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/903145#M356890</link>
      <description>&lt;P&gt;HI Harry, I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.&lt;/P&gt;
&lt;P&gt;from the first 1st row, i highlighted the values i'm looking for.&lt;/P&gt;
&lt;P&gt;FOr ex: if the input data is ffreg&lt;STRONG&gt;45646&lt;/STRONG&gt;&amp;nbsp;%345^$45e 0005353532342 ertehf%&lt;STRONG&gt;43432&lt;/STRONG&gt;- 435-- erte00&lt;STRONG&gt;34344&amp;nbsp;&lt;/STRONG&gt; the expected output is &lt;STRONG&gt;45646 ; 43432; 34344&lt;/STRONG&gt; in this case the delimiter is space&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;from the first delimited value ffreg45646 i'want to have value 45646. as it has length of 5 digits continuously.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;from 2nd delimited value %345^$45e it has special character in between and the length 3 char and 2 char which not a valid value for me. it should continuous 5 digits in this case it has special character in between.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;from 3rd delimited value 0005353532342 we need to ingore all preceeding zeros(0) and the length should be equal to 5 however it have length 10. this also doesn't meet my requirement so it should be excluded from output&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;from 4th delimited value ertehf%43432 it has char and special char both should be trimmed and leftover numbers are equal to exactly 5 , which meets the requirement hence should be part of output&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;for 5th delimited value 435-- erte0034344&amp;nbsp;, i would like to extract only last 5 digit because it had number less then 5 length and special char and char should be trimmed&lt;/P&gt;</description>
      <pubDate>Wed, 15 Nov 2023 08:59:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/903145#M356890</guid>
      <dc:creator>sss</dc:creator>
      <dc:date>2023-11-15T08:59:25Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911363#M359386</link>
      <description>&lt;P&gt;Any suggestion mates &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jan 2024 10:29:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911363#M359386</guid>
      <dc:creator>sss</dc:creator>
      <dc:date>2024-01-12T10:29:32Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911364#M359387</link>
      <description>&lt;P&gt;Appreciate your help&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jan 2024 10:31:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911364#M359387</guid>
      <dc:creator>sss</dc:creator>
      <dc:date>2024-01-12T10:31:46Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911459#M359406</link>
      <description>&lt;P&gt;You second row has txt="&lt;SPAN&gt;&amp;nbsp;4235354 sfd4t5345345 erteyye 45634# ye54545&lt;/SPAN&gt;&lt;SPAN&gt;"&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Why is your expected result not&amp;nbsp; 45634;54545?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Assuming the second row result should be 45634;54545, then just loop of line segments:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
  input txt $70.  @71 expected_result $70. ;
datalines4;
ffreg45646 %345^$45e 0005353532342 ertehf%43432- 435-- erte0034344    45646;43432;34344 
 4235354 sfd4t5345345 erteyye 45634# ye54545                          54545
00055 0545554 54454545 000000052345                                   52345
wer43423, 445wsfsf4535, 4344%455                                      43423
;;;;run;

data want (drop=_:);
  set have ;
  length result $70;

  *text segments are separated by comma,colon, semi-colon, space ;
  do _s=1 to countw(txt,',:; ');
    _segment=scan(txt,_s,',:; ');            
    _d=indexc(_segment,'123456789');          /*Leftmost digit, except 0 */
    if _d=0 then continue;
    _segment=substr(_segment,_d);             /*Left justify to leading non-zero digit*/

    /*If leftmost non-digit is position 6, then update RESULT*/
    if notdigit(_segment)=6 then result= catx(';',result,substr(_segment,1,5));
  end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN&gt;This code assumes that each segment (i.e. delimited by blank, colon, semi-colon, or comma) has no more than one candidate for inclusion in the result, and that the first non-zero digit is the start of the candidate.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 13 Jan 2024 16:41:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911459#M359406</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2024-01-13T16:41:20Z</dc:date>
    </item>
    <item>
      <title>Re: Text Mining</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911476#M359414</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/16723"&gt;@sss&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;HI Harry,&lt;BR /&gt;&lt;BR /&gt;I appreciate your help, your code works great however my requirement is different from your output. Let me try to simplify it.&lt;BR /&gt;&lt;BR /&gt;from the first 1st row, i highlighted the values i'm looking for. &lt;BR /&gt;For ex:&lt;BR /&gt;if the input data is &lt;BR /&gt;ffreg45646&amp;nbsp;%345^$45e 0005353532342 ertehf%43432- 435-- erte0034344&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;the expected output is &lt;BR /&gt;45646 ; 43432; 34344&lt;BR /&gt;&lt;BR /&gt;in this case the delimiter is space&lt;BR /&gt;&lt;BR /&gt;... stuff deleted ...&lt;BR /&gt;&lt;BR /&gt;for 5th delimited value 435-- erte0034344&amp;nbsp;, i would like to extract only last 5 digit because it had number less then 5 length and specialchar and char should be trimmed&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Actually, given that your text has a blank (one of the 4 delimiters you specificed) prior to the "ert",&amp;nbsp; the 5th delimited value is just "435--", and you have a sixth delimited value: "erte0034344", which qualifies "34344" for the result variable since there are 5 contiguous digits after removal of leading zeroes.&lt;/P&gt;</description>
      <pubDate>Sat, 13 Jan 2024 16:34:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Text-Mining/m-p/911476#M359414</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2024-01-13T16:34:17Z</dc:date>
    </item>
  </channel>
</rss>

