<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: data extraction from text data in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681711#M206235</link>
    <description>Hi Again,&lt;BR /&gt;&lt;BR /&gt;I have 280 comments in column. Is there a way to not include them in the data step?&lt;BR /&gt;Is it possible to use the data, set, run commands in place of data, infile, input cards?&lt;BR /&gt;Also, the length of the comments column is 2522. Lastly, Is giffen gaffen guffen your comment:) Pretty innovative, I must say!</description>
    <pubDate>Fri, 04 Sep 2020 17:52:57 GMT</pubDate>
    <dc:creator>shahm</dc:creator>
    <dc:date>2020-09-04T17:52:57Z</dc:date>
    <item>
      <title>data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681635#M206193</link>
      <description>&lt;P&gt;Good Morning,&lt;/P&gt;&lt;P&gt;I have a column with text data (comments) and I would like to extract information from it. e.g. "Several documents for shipments of UN1791 missing date of shipment". I would like to extract UN1791. Also, the comments can have a space eg. UN 1791 and can have more than one UN numbers mentioned in a single comment eg. "please correct the following: UN 1202, UN 1992, UN 2794"&lt;/P&gt;&lt;P&gt;I need to extract only UN numbers from the comments.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;my sample code is basic (I am a new to&amp;nbsp; SAS less than 6 months):&lt;/P&gt;&lt;P&gt;data UNmatches;&lt;BR /&gt;set work.un;&lt;BR /&gt;UN_matches=index(violation_comments_txt, 'UN');&lt;BR /&gt;run;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I understand that this code gives me the position where UN appears in the comments.&lt;/P&gt;&lt;P&gt;But I need to get the numbers.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any help is greatly appreciated.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Maitri&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 04 Sep 2020 13:53:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681635#M206193</guid>
      <dc:creator>shahm</dc:creator>
      <dc:date>2020-09-04T13:53:19Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681663#M206205</link>
      <description>&lt;P&gt;One possibility is to use Pearl regular expressions (PRX):&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
  infile cards truncover;
  input str $100.;
cards;
giffen gaffen guffen UN 3042 pilaffen dyp UN4444 tititi UN 1111
Several documents for shipments of UN1791 missing date of shipment
;run;

data want;
  set have;
  prxid=prxparse('/\bUN\s*\d{4}\b/');
  start=1;
  stop=length(str);
  length UN_no $6;
  do while(1);
    call prxnext(prxid,start,stop,str,pos,len);
    if pos=0 then leave;
    UN_no=compress(substr(str,pos,len));
    output;
    end;
  drop start stop pos len prxid;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;An explanation of the PRX expression (the parameter to PRXPARSE):&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;\b is a word boundary&lt;/LI&gt;
&lt;LI&gt;\s* is zero or more whitespace characters (e.g. blanks)&lt;/LI&gt;
&lt;LI&gt;\d{4} is 4 digits&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 04 Sep 2020 15:13:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681663#M206205</guid>
      <dc:creator>s_lassen</dc:creator>
      <dc:date>2020-09-04T15:13:07Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681665#M206206</link>
      <description>Thank you so much for your reply! I will read up on PRX, apply the provided solution and inform you how it goes.</description>
      <pubDate>Fri, 04 Sep 2020 15:18:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681665#M206206</guid>
      <dc:creator>shahm</dc:creator>
      <dc:date>2020-09-04T15:18:22Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681711#M206235</link>
      <description>Hi Again,&lt;BR /&gt;&lt;BR /&gt;I have 280 comments in column. Is there a way to not include them in the data step?&lt;BR /&gt;Is it possible to use the data, set, run commands in place of data, infile, input cards?&lt;BR /&gt;Also, the length of the comments column is 2522. Lastly, Is giffen gaffen guffen your comment:) Pretty innovative, I must say!</description>
      <pubDate>Fri, 04 Sep 2020 17:52:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681711#M206235</guid>
      <dc:creator>shahm</dc:creator>
      <dc:date>2020-09-04T17:52:57Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681767#M206264</link>
      <description>&lt;P&gt;Try next tested code&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want(keep=comment UN_match);
 set have;
     length UN_match $7;
     text = upcase(compbl(comment));
	 pos = index(text,'UN');
	 if pos = 0 then delete; else
	 if pos &amp;gt; 1 then do; 
	    text = substr(text,pos); 
		pos=1; 
	 end;
	 do until (pos=0);
	    if input(substr(text,pos+2,5),?? best5.) ne . then do;
		    if substr(text,pos+2,1)=' ' 
			   then unlen=7;
			   else unlen=6;
			UN_match = substr(text,pos,unlen);
			output;
			text = substr(text,pos+unlen);
		 end;
		 else text = substr(text,pos+3); * skip UNx *;
		 if length(strip(text)) &amp;lt; 6 then leave;
	     pos = index(text,'UN');
	 end;
run;
		
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;It will be useful if you provide more sample data in the first step.&lt;/P&gt;
&lt;P&gt;The code need be adapted to maximum length of the comment.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do you need the comment in the output?&lt;/P&gt;
&lt;P&gt;You may prefer sort the result by UN_match with NODUPKEY&amp;nbsp; option.&lt;/P&gt;
&lt;P&gt;The code will skip any word starting with 'UN' and not followed by numbers (like UNxyzsd / UN abcd )&lt;/P&gt;</description>
      <pubDate>Fri, 04 Sep 2020 22:23:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681767#M206264</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-09-04T22:23:35Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681830#M206301</link>
      <description>&lt;P&gt;Based on&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/76464"&gt;@s_lassen&lt;/a&gt;'s code, using PRX function, you may prefer next code with its output result:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want1;
 set have;
     length UN_match $50 txt1 $6;
     regexid = prxparse('/\bUN\s*\d{4}\b/');
     pos1=1; 
     posend = length(comment);
     UN_match = ' ';
     do while (1);
        call prxnext(regexid,pos1,posend,comment,pos,lent);
        if pos=0 then leave;
        txt1 = compress(substr(comment,pos,lent));
        *put pos= txt1=;
	    UN_match = catx(' ',UN_match, txt1);
	 end;
	 *putlog UN_match= comment=;
	 keep UN_match comment;
run;	 
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;If you don't need the comment in the output remove it from the keep statement.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 05 Sep 2020 19:23:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681830#M206301</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-09-05T19:23:33Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681833#M206304</link>
      <description>&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN&gt;I have 280 comments in column. Is there a way to not include them in the data step?&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN&gt;Are you just talking about the data step in the answer(s) that is being used to create an example source dataset?&amp;nbsp; If so then skip that step and pull in your existing dataset with a SET statement instead.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If instead you mean that you want to exclude observations being read in from your existing dataset please explain the selection logic.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 05 Sep 2020 20:40:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681833#M206304</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2020-09-05T20:40:47Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681835#M206306</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I think you can take advantage of the fact that, say,&amp;nbsp; compress("UN 1234") =compress("UN1234 "), so you can always take 7 characters at a time when you find a "UN".&amp;nbsp; Using the sample data set provided by &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/88384"&gt;@Shmuel&lt;/a&gt;, consider this code: &lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want (drop=_:);
  set have;
  length un_match $50;

  _c=findw(comment,'UN',' 0123456789');
  if f^=0 then do until(_c_increment=0);  
	un_match=catx(',',un_match,compress(substr(comment,_c,7)));
	_c_increment=findw(substr(comment,_c+1),'UN',' 0123456789');
	_c=_c+_c_increment;
  end;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The other useful tool here is telling the FINDW function that not only blanks are word separators, but so is any digit.&amp;nbsp; So the findw function above will locate a UN followed by a blank or a digit (and preceded by any of them too).&amp;nbsp; This would allow you to skip a string like "UNITED", while capturing both variation of UN codes that you identify.&lt;/P&gt;</description>
      <pubDate>Sat, 05 Sep 2020 22:27:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/681835#M206306</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2020-09-05T22:27:14Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/682244#M206506</link>
      <description>Hello,&lt;BR /&gt;&lt;BR /&gt;Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.&lt;BR /&gt;&lt;BR /&gt;Sincerely,&lt;BR /&gt;Maitri</description>
      <pubDate>Tue, 08 Sep 2020 13:29:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/682244#M206506</guid>
      <dc:creator>shahm</dc:creator>
      <dc:date>2020-09-08T13:29:46Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/682247#M206508</link>
      <description>Hello,&lt;BR /&gt;What I meant is I don't want copy paste the 280 comments in the example source dataset. So it seems I can put a few comments in the example source dataset and continue. Is that correct?&lt;BR /&gt;&lt;BR /&gt;Sincerely,&lt;BR /&gt;Maitri</description>
      <pubDate>Tue, 08 Sep 2020 13:33:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/682247#M206508</guid>
      <dc:creator>shahm</dc:creator>
      <dc:date>2020-09-08T13:33:53Z</dc:date>
    </item>
    <item>
      <title>Re: data extraction from text data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/682258#M206510</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/345066"&gt;@shahm&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;Hello,&lt;BR /&gt;&lt;BR /&gt;Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.&lt;BR /&gt;&lt;BR /&gt;Sincerely,&lt;BR /&gt;Maitri&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Have you tried the solution posted by&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/76464"&gt;@s_lassen&lt;/a&gt;&amp;nbsp;?&lt;/P&gt;
&lt;P&gt;You just need to add the COMMENT to the drop statement at the bottom of the program&lt;/P&gt;</description>
      <pubDate>Tue, 08 Sep 2020 14:29:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-extraction-from-text-data/m-p/682258#M206510</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-09-08T14:29:01Z</dc:date>
    </item>
  </channel>
</rss>

