BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
daradanye
Obsidian | Level 7

Hi, 

 

I am cleaning a dataset that contains a lot of messy company names like this:

 

Name
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA

 

As can be seen, each column contains the percentage, geography, and number.  The ideal cleaned data is like this:

 

New
Dresser-Rand International B.V.
Becker CPA Review Limited, Corporation
Union Planters National Bank
Hypercom Horizon, Inc
El Paso Energy Service Company

 

Now I write a code trying to split by "-" first:

 

data m1;set m1;
now=trim(scan(sub,1,"-"));
run;

 It leads to this:

Now
Dresser
Becker CPA Review Limited (2), Corporation
Union Planters National Bank (a)(1)  99.90%
21.  Hypercom Horizon, Inc
El Paso Energy Service Company   100.0000

 

I understand since I split by "-" and take the first part, I get "Dresser" instead of "Dresser-Rand International B.V.".  Is there any way to split by the last "-"? 

 

Also, I guess I need to use some regex to replace the special characters (like 99.90%, 100.0000, (1), (a)).  I looked some manual on regex but still, feel very confused.  I will appreciate it very much if someone can give me some hints in this case.

 

 

Thanks a lot!

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Like this?

data have; 
  infile datalines truncover;
  input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;     

data want ; set have;
  length w $80;
  w = prxchange('s/(.*)(-.*)$/$1/',1,trim(line));  * remove last substring delimited by hyphen;
  w = prxchange('s/(( |\A)[%\d\.]+ )//',-1,w);     * remove detached numbers;     
  w = prxchange('s/\(.*\)//',-1,trim(w));          * remove parentheses blocks;     
  w = prxchange('s/\ ,/,/',-1,strip(w));           * remove spaces before commas;    
run;

Dresser-Rand International B.V.
Becker CPA Review Limited, Corporation
Union Planters National Bank
Hypercom Horizon, Inc
El Paso Energy Service Company

View solution in original post

5 REPLIES 5
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @daradanye 

 

The following code works with your data. But there might be other cases where something not covered here should be removed.

 

The first prxchange keeps anything before the last hyphen. 

the next removed a separate word containing only period, percentage sign or digits, 

the third removes anything within parentheses,

and the last takes care of a period left over in the second record.

 

data have; 
	infile datalines truncover;
	input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;

data want (drop=w); set have;
	length company w $80.;
	w = prxchange('s/(.*)-(.*$)/$1/',-1,trim(line));
	w = prxchange('s/(.*)\s([\d\.%]*$)/$1/',-1,trim(w));
	w = prxchange('s/\(.*\)/ /',-1,trim(w));
	company = prxchange('s/\s,\s//',-1,trim(w));
run;
daradanye
Obsidian | Level 7

Hi @ErikLund_Jensen ,

 

Thanks so much!  

 

I tried your code and it runs through.  But for the fourth record, it still keeps "21.".  Is there any way to get rid of it?  

ChrisNZ
Tourmaline | Level 20

Like this?

data have; 
  infile datalines truncover;
  input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1)  99.90% - USA
21.  Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company   100.0000  - Delaware, USA
;
run;     

data want ; set have;
  length w $80;
  w = prxchange('s/(.*)(-.*)$/$1/',1,trim(line));  * remove last substring delimited by hyphen;
  w = prxchange('s/(( |\A)[%\d\.]+ )//',-1,w);     * remove detached numbers;     
  w = prxchange('s/\(.*\)//',-1,trim(w));          * remove parentheses blocks;     
  w = prxchange('s/\ ,/,/',-1,strip(w));           * remove spaces before commas;    
run;

Dresser-Rand International B.V.
Becker CPA Review Limited, Corporation
Union Planters National Bank
Hypercom Horizon, Inc
El Paso Energy Service Company

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @daradanye 

 

The number 21. is taken care of by the good modifications to my code by @ChrisNZ

 

I would expect more problems to pop up with a full input data set. Either something more that should be removed, or too much cleaning, where a meaningful part of a company name is removed - what would happen if the next company name in your data is Century 21.

 

As you probably will need more modifications to the code, you should acquire some regex knowledge. The basics are covered in the very good tip sheet 

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

 

ChrisNZ
Tourmaline | Level 20

It looks like the data always includes a dot when the number should be removed, so that may be a way to spot unwanted numbers.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1238 views
  • 5 likes
  • 3 in conversation