Solved: SAS scan(trim) and regex

daradanye · Posted 03-25-2019 12:16 PM

Hi,

I am cleaning a dataset that contains a lot of messy company names like this:

Name

Dresser-Rand International B.V. 100.0 - Netherlands

Becker CPA Review Limited (2), Corporation - Israel

Union Planters National Bank (a)(1) 99.90% - USA

21. Hypercom Horizon, Inc - Missouri, USA

El Paso Energy Service Company 100.0000 - Delaware, USA

As can be seen, each column contains the percentage, geography, and number. The ideal cleaned data is like this:

New

Dresser-Rand International B.V.

Becker CPA Review Limited, Corporation

Union Planters National Bank

Hypercom Horizon, Inc

El Paso Energy Service Company

Now I write a code trying to split by "-" first:

data m1;set m1;
now=trim(scan(sub,1,"-"));
run;

It leads to this:

Now

Dresser

Becker CPA Review Limited (2), Corporation

Union Planters National Bank (a)(1) 99.90%

21. Hypercom Horizon, Inc

El Paso Energy Service Company 100.0000

I understand since I split by "-" and take the first part, I get "Dresser" instead of "Dresser-Rand International B.V.". Is there any way to split by the last "-"?

Also, I guess I need to use some regex to replace the special characters (like 99.90%, 100.0000, (1), (a)). I looked some manual on regex but still, feel very confused. I will appreciate it very much if someone can give me some hints in this case.

Thanks a lot!

ChrisNZ · Posted 03-25-2019 11:21 PM

Like this?

data have; 
  infile datalines truncover;
  input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;     

data want ; set have;
  length w $80;
  w = prxchange('s/(.*)(-.*)$/$1/',1,trim(line));  * remove last substring delimited by hyphen;
  w = prxchange('s/(( |\A)[%\d\.]+ )//',-1,w);     * remove detached numbers;     
  w = prxchange('s/\(.*\)//',-1,trim(w));          * remove parentheses blocks;     
  w = prxchange('s/\ ,/,/',-1,strip(w));           * remove spaces before commas;    
run;

Dresser-Rand International B.V.
Becker CPA Review Limited, Corporation
Union Planters National Bank
Hypercom Horizon, Inc
El Paso Energy Service Company

High-Performance SAS Coding - Third Edition

View solution in original post

ErikLund_Jensen · Posted 03-25-2019 12:58 PM

Hi @daradanye

The following code works with your data. But there might be other cases where something not covered here should be removed.

The first prxchange keeps anything before the last hyphen.

the next removed a separate word containing only period, percentage sign or digits,

the third removes anything within parentheses,

and the last takes care of a period left over in the second record.

data have; 
	infile datalines truncover;
	input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;

data want (drop=w); set have;
	length company w $80.;
	w = prxchange('s/(.*)-(.*$)/$1/',-1,trim(line));
	w = prxchange('s/(.*)\s([\d\.%]*$)/$1/',-1,trim(w));
	w = prxchange('s/\(.*\)/ /',-1,trim(w));
	company = prxchange('s/\s,\s//',-1,trim(w));
run;

daradanye · Posted 03-25-2019 02:01 PM

Hi @ErikLund_Jensen ,

Thanks so much!

I tried your code and it runs through. But for the fourth record, it still keeps "21.". Is there any way to get rid of it?

ChrisNZ · Posted 03-25-2019 11:21 PM

Like this?

data have; 
  infile datalines truncover;
  input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;     

data want ; set have;
  length w $80;
  w = prxchange('s/(.*)(-.*)$/$1/',1,trim(line));  * remove last substring delimited by hyphen;
  w = prxchange('s/(( |\A)[%\d\.]+ )//',-1,w);     * remove detached numbers;     
  w = prxchange('s/\(.*\)//',-1,trim(w));          * remove parentheses blocks;     
  w = prxchange('s/\ ,/,/',-1,strip(w));           * remove spaces before commas;    
run;

Dresser-Rand International B.V.
Becker CPA Review Limited, Corporation
Union Planters National Bank
Hypercom Horizon, Inc
El Paso Energy Service Company

High-Performance SAS Coding - Third Edition

ErikLund_Jensen · Posted 03-26-2019 04:47 AM

Hi @daradanye

The number 21. is taken care of by the good modifications to my code by @ChrisNZ.

I would expect more problems to pop up with a full input data set. Either something more that should be removed, or too much cleaning, where a meaningful part of a company name is removed - what would happen if the next company name in your data is Century 21.

As you probably will need more modifications to the code, you should acquire some regex knowledge. The basics are covered in the very good tip sheet

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

ChrisNZ · Posted 03-26-2019 05:13 PM

It looks like the data always includes a dot when the number should be removed, so that may be a way to spot unwanted numbers.

High-Performance SAS Coding - Third Edition

SAS scan(trim) and regex

Re: SAS scan(trim) and regex

Re: SAS scan(trim) and regex

Re: SAS scan(trim) and regex

Re: SAS scan(trim) and regex

Re: SAS scan(trim) and regex

Re: SAS scan(trim) and regex

Classroom Training Available!