Hi,
I am cleaning a dataset that contains a lot of messy company names like this:
Name |
Dresser-Rand International B.V. 100.0 - Netherlands |
Becker CPA Review Limited (2), Corporation - Israel |
Union Planters National Bank (a)(1) 99.90% - USA |
21. Hypercom Horizon, Inc - Missouri, USA |
El Paso Energy Service Company 100.0000 - Delaware, USA |
As can be seen, each column contains the percentage, geography, and number. The ideal cleaned data is like this:
New |
Dresser-Rand International B.V. |
Becker CPA Review Limited, Corporation |
Union Planters National Bank |
Hypercom Horizon, Inc |
El Paso Energy Service Company |
Now I write a code trying to split by "-" first:
data m1;set m1;
now=trim(scan(sub,1,"-"));
run;
It leads to this:
Now |
Dresser |
Becker CPA Review Limited (2), Corporation |
Union Planters National Bank (a)(1) 99.90% |
21. Hypercom Horizon, Inc |
El Paso Energy Service Company 100.0000 |
I understand since I split by "-" and take the first part, I get "Dresser" instead of "Dresser-Rand International B.V.". Is there any way to split by the last "-"?
Also, I guess I need to use some regex to replace the special characters (like 99.90%, 100.0000, (1), (a)). I looked some manual on regex but still, feel very confused. I will appreciate it very much if someone can give me some hints in this case.
Thanks a lot!
Like this?
data have;
infile datalines truncover;
input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1) 99.90% - USA
21. Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company 100.0000 - Delaware, USA
;
run;
data want ; set have;
length w $80;
w = prxchange('s/(.*)(-.*)$/$1/',1,trim(line)); * remove last substring delimited by hyphen;
w = prxchange('s/(( |\A)[%\d\.]+ )//',-1,w); * remove detached numbers;
w = prxchange('s/\(.*\)//',-1,trim(w)); * remove parentheses blocks;
w = prxchange('s/\ ,/,/',-1,strip(w)); * remove spaces before commas;
run;
Dresser-Rand International B.V.
Becker CPA Review Limited, Corporation
Union Planters National Bank
Hypercom Horizon, Inc
El Paso Energy Service Company
Hi @daradanye
The following code works with your data. But there might be other cases where something not covered here should be removed.
The first prxchange keeps anything before the last hyphen.
the next removed a separate word containing only period, percentage sign or digits,
the third removes anything within parentheses,
and the last takes care of a period left over in the second record.
data have; infile datalines truncover; input line $char100.; datalines; Dresser-Rand International B.V. 100.0 - Netherlands Becker CPA Review Limited (2), Corporation - Israel Union Planters National Bank (a)(1) 99.90% - USA 21. Hypercom Horizon, Inc - Missouri, USA El Paso Energy Service Company 100.0000 - Delaware, USA ; run; data want (drop=w); set have; length company w $80.; w = prxchange('s/(.*)-(.*$)/$1/',-1,trim(line)); w = prxchange('s/(.*)\s([\d\.%]*$)/$1/',-1,trim(w)); w = prxchange('s/\(.*\)/ /',-1,trim(w)); company = prxchange('s/\s,\s//',-1,trim(w)); run;
Hi @ErikLund_Jensen ,
Thanks so much!
I tried your code and it runs through. But for the fourth record, it still keeps "21.". Is there any way to get rid of it?
Like this?
data have;
infile datalines truncover;
input line $char100.;
datalines;
Dresser-Rand International B.V. 100.0 - Netherlands
Becker CPA Review Limited (2), Corporation - Israel
Union Planters National Bank (a)(1) 99.90% - USA
21. Hypercom Horizon, Inc - Missouri, USA
El Paso Energy Service Company 100.0000 - Delaware, USA
;
run;
data want ; set have;
length w $80;
w = prxchange('s/(.*)(-.*)$/$1/',1,trim(line)); * remove last substring delimited by hyphen;
w = prxchange('s/(( |\A)[%\d\.]+ )//',-1,w); * remove detached numbers;
w = prxchange('s/\(.*\)//',-1,trim(w)); * remove parentheses blocks;
w = prxchange('s/\ ,/,/',-1,strip(w)); * remove spaces before commas;
run;
Dresser-Rand International B.V.
Becker CPA Review Limited, Corporation
Union Planters National Bank
Hypercom Horizon, Inc
El Paso Energy Service Company
Hi @daradanye
The number 21. is taken care of by the good modifications to my code by @ChrisNZ.
I would expect more problems to pop up with a full input data set. Either something more that should be removed, or too much cleaning, where a meaningful part of a company name is removed - what would happen if the next company name in your data is Century 21.
As you probably will need more modifications to the code, you should acquire some regex knowledge. The basics are covered in the very good tip sheet
https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf
It looks like the data always includes a dot when the number should be removed, so that may be a way to spot unwanted numbers.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.