topic Re: SAS scan(trim) and regex in SAS Programming

SAS scan(trim) and regex

daradanye — Mon, 25 Mar 2019 16:16:14 GMT

Hi,

I am cleaning a dataset that contains a lot of messy company names like this:

Name

Dresser-Rand International B.V. 100.0 - Netherlands

Becker CPA Review Limited (2), Corporation - Israel

Union Planters National Bank (a)(1) 99.90% - USA

21. Hypercom Horizon, Inc - Missouri, USA

El Paso Energy Service Company 100.0000 - Delaware, USA

As can be seen, each column contains the percentage, geography, and number. The ideal cleaned data is like this:

New

Dresser-Rand International B.V.

Becker CPA Review Limited, Corporation

Union Planters National Bank

Hypercom Horizon, Inc

El Paso Energy Service Company

Now I write a code trying to split by "-" first:

data m1;set m1;
now=trim(scan(sub,1,"-"));
run;

It leads to this:

Now

Dresser

Becker CPA Review Limited (2), Corporation

Union Planters National Bank (a)(1) 99.90%

21. Hypercom Horizon, Inc

El Paso Energy Service Company 100.0000

I understand since I split by "-" and take the first part, I get "Dresser" instead of "Dresser-Rand International B.V.". Is there any way to split by the last "-"?

Also, I guess I need to use some regex to replace the special characters (like 99.90%, 100.0000, (1), (a)). I looked some manual on regex but still, feel very confused. I will appreciate it very much if someone can give me some hints in this case.

Thanks a lot!

Re: SAS scan(trim) and regex

ErikLund_Jensen — Mon, 25 Mar 2019 16:58:06 GMT

Hi @daradanye

The following code works with your data. But there might be other cases where something not covered here should be removed.

The first prxchange keeps anything before the last hyphen.

the next removed a separate word containing only period, percentage sign or digits,

the third removes anything within parentheses,

and the last takes care of a period left over in the second record.

data have; 
	infile datalines truncover;
	input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;

data want (drop=w); set have;
	length company w $80.;
	w = prxchange('s/(.*)-(.*$)/$1/',-1,trim(line));
	w = prxchange('s/(.*)\s([\d\.%]*$)/$1/',-1,trim(w));
	w = prxchange('s/\(.*\)/ /',-1,trim(w));
	company = prxchange('s/\s,\s//',-1,trim(w));
run;

Re: SAS scan(trim) and regex

daradanye — Mon, 25 Mar 2019 18:01:41 GMT

Hi @ErikLund_Jensen ,

Thanks so much!

I tried your code and it runs through. But for the fourth record, it still keeps "21.". Is there any way to get rid of it?

Re: SAS scan(trim) and regex

ChrisNZ — Tue, 26 Mar 2019 03:22:10 GMT

Like this?

data have; 
  infile datalines truncover;
  input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;     

data want ; set have;
  length w $80;
  w = prxchange('s/(.*)(-.*)$/$1/',1,trim(line));  * remove last substring delimited by hyphen;
  w = prxchange('s/(( |\A)[%\d\.]+ )//',-1,w);     * remove detached numbers;     
  w = prxchange('s/\(.*\)//',-1,trim(w));          * remove parentheses blocks;     
  w = prxchange('s/\ ,/,/',-1,strip(w));           * remove spaces before commas;    
run;

Dresser-Rand International B.V.
Becker CPA Review Limited, Corporation
Union Planters National Bank
Hypercom Horizon, Inc
El Paso Energy Service Company

Re: SAS scan(trim) and regex

ErikLund_Jensen — Wed, 27 Mar 2019 08:21:02 GMT

Hi @daradanye

The number 21. is taken care of by the good modifications to my code by @ChrisNZ.

I would expect more problems to pop up with a full input data set. Either something more that should be removed, or too much cleaning, where a meaningful part of a company name is removed - what would happen if the next company name in your data is Century 21.

As you probably will need more modifications to the code, you should acquire some regex knowledge. The basics are covered in the very good tip sheet

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

Re: SAS scan(trim) and regex

ChrisNZ — Tue, 26 Mar 2019 21:13:31 GMT

It looks like the data always includes a dot when the number should be removed, so that may be a way to spot unwanted numbers.