<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: SAS scan(trim) and regex in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/545897#M151092</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/32269"&gt;@daradanye&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The following code works with your data. But there might be other cases where something not covered here should be removed.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The first prxchange keeps anything before the last hyphen.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the next removed a separate word containing only period, percentage sign or digits,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the third removes anything within parentheses,&lt;/P&gt;
&lt;P&gt;and the last takes care of a period left over in the second record.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;data have; 
	infile datalines truncover;
	input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;

data want (drop=w); set have;
	length company w $80.;
	w = prxchange('s/(.*)-(.*$)/$1/',-1,trim(line));
	w = prxchange('s/(.*)\s([\d\.%]*$)/$1/',-1,trim(w));
	w = prxchange('s/\(.*\)/ /',-1,trim(w));
	company = prxchange('s/\s,\s//',-1,trim(w));
run;
&lt;/PRE&gt;</description>
    <pubDate>Mon, 25 Mar 2019 16:58:06 GMT</pubDate>
    <dc:creator>ErikLund_Jensen</dc:creator>
    <dc:date>2019-03-25T16:58:06Z</dc:date>
    <item>
      <title>SAS scan(trim) and regex</title>
      <link>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/545881#M151087</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am cleaning a dataset that contains a lot of messy company names like this:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;Name&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Dresser-Rand International B.V. 100.0 - Netherlands&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Becker CPA Review Limited (2), Corporation - Israel&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Union Planters National Bank (a)(1)&amp;nbsp; 99.90% - USA&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;21.&amp;nbsp; Hypercom Horizon, Inc - Missouri, USA&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;El Paso Energy Service Company&amp;nbsp;&amp;nbsp; 100.0000&amp;nbsp; - Delaware, USA&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As can be seen, each column contains the percentage, geography, and number.&amp;nbsp; The ideal cleaned data is like this:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;New&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Dresser-Rand International B.V.&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Becker CPA Review Limited, Corporation&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Union Planters National Bank&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Hypercom Horizon, Inc&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;El Paso Energy Service Company&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now I write a code trying to split by "-" first:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data m1;set m1;
now=trim(scan(sub,1,"-"));
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;It leads to this:&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;Now&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Dresser&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Becker CPA Review Limited (2), Corporation&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Union Planters National Bank (a)(1)&amp;nbsp; 99.90%&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;21.&amp;nbsp; Hypercom Horizon, Inc&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;El Paso Energy Service Company&amp;nbsp;&amp;nbsp; 100.0000&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I understand since I split by "-" and take the first part, I get "Dresser" instead of "Dresser-Rand International B.V.".&amp;nbsp; Is there any way to split by the last "-"?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Also, I guess I need to use some regex to replace the special characters (like 99.90%, 100.0000, (1), (a)).&amp;nbsp; I looked some manual on regex but still, feel very confused.&amp;nbsp; I will appreciate it very much if someone can give me some hints in this case.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks a lot!&lt;/P&gt;</description>
      <pubDate>Mon, 25 Mar 2019 16:16:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/545881#M151087</guid>
      <dc:creator>daradanye</dc:creator>
      <dc:date>2019-03-25T16:16:14Z</dc:date>
    </item>
    <item>
      <title>Re: SAS scan(trim) and regex</title>
      <link>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/545897#M151092</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/32269"&gt;@daradanye&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The following code works with your data. But there might be other cases where something not covered here should be removed.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The first prxchange keeps anything before the last hyphen.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the next removed a separate word containing only period, percentage sign or digits,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the third removes anything within parentheses,&lt;/P&gt;
&lt;P&gt;and the last takes care of a period left over in the second record.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;data have; 
	infile datalines truncover;
	input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;

data want (drop=w); set have;
	length company w $80.;
	w = prxchange('s/(.*)-(.*$)/$1/',-1,trim(line));
	w = prxchange('s/(.*)\s([\d\.%]*$)/$1/',-1,trim(w));
	w = prxchange('s/\(.*\)/ /',-1,trim(w));
	company = prxchange('s/\s,\s//',-1,trim(w));
run;
&lt;/PRE&gt;</description>
      <pubDate>Mon, 25 Mar 2019 16:58:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/545897#M151092</guid>
      <dc:creator>ErikLund_Jensen</dc:creator>
      <dc:date>2019-03-25T16:58:06Z</dc:date>
    </item>
    <item>
      <title>Re: SAS scan(trim) and regex</title>
      <link>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/545911#M151095</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12887"&gt;@ErikLund_Jensen&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks so much!&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried your code and it runs through.&amp;nbsp; But for the fourth record, it still keeps "21.".&amp;nbsp; Is there any way to get rid of it?&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 25 Mar 2019 18:01:41 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/545911#M151095</guid>
      <dc:creator>daradanye</dc:creator>
      <dc:date>2019-03-25T18:01:41Z</dc:date>
    </item>
    <item>
      <title>Re: SAS scan(trim) and regex</title>
      <link>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/546018#M151136</link>
      <description>&lt;P&gt;Like this?&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have; 
  infile datalines truncover;
  input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;     

data want ; set have;
  length w $80;
  w = prxchange('s/(.*)(-.*)$/$1/',1,trim(line));  * remove last substring delimited by hyphen;
  w = prxchange('s/(( |\A)[%\d\.]+ )//',-1,w);     * remove detached numbers;     
  w = prxchange('s/\(.*\)//',-1,trim(w));          * remove parentheses blocks;     
  w = prxchange('s/\ ,/,/',-1,strip(w));           * remove spaces before commas;    
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Dresser-Rand International B.V.&lt;BR /&gt;Becker CPA Review Limited, Corporation&lt;BR /&gt;Union Planters National Bank&lt;BR /&gt;Hypercom Horizon, Inc&lt;BR /&gt;El Paso Energy Service Company&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2019 03:22:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/546018#M151136</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-03-26T03:22:10Z</dc:date>
    </item>
    <item>
      <title>Re: SAS scan(trim) and regex</title>
      <link>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/546052#M151151</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/32269"&gt;@daradanye&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The number 21. is taken care of by the good modifications to my code by&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/16961"&gt;@ChrisNZ&lt;/a&gt;.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I would expect more problems to pop up with a full input data set. Either something more that should be removed, or too much cleaning, where a meaningful part of a company name is removed - what would happen if the next company name in your data is&amp;nbsp;&lt;EM&gt;Century 21.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;As you probably will need more modifications to the code, you should acquire some regex knowledge. The basics are covered in the very good tip sheet&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf" target="_blank" rel="noopener"&gt;https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 27 Mar 2019 08:21:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/546052#M151151</guid>
      <dc:creator>ErikLund_Jensen</dc:creator>
      <dc:date>2019-03-27T08:21:02Z</dc:date>
    </item>
    <item>
      <title>Re: SAS scan(trim) and regex</title>
      <link>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/546312#M151241</link>
      <description>&lt;P&gt;It looks like the data always includes a dot when the number should be removed, so that may be a way to spot unwanted numbers.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2019 21:13:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/SAS-scan-trim-and-regex/m-p/546312#M151241</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-03-26T21:13:31Z</dc:date>
    </item>
  </channel>
</rss>

