BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Agent1592
Pyrite | Level 9

Dear SAS Community:

I have company names but the problem is that some have INC, CORP, US at the end of the company name. How can I remove "INC", "INC.", "CORP",  "INC A D", "CO" at the end of the company name? I tried substr, prxchange and others but I could not find a good solution because it was removing INC when it was part of the company name.

 

data WORK.have;
  infile datalines dsd truncover;
  input COMPANY_NAME:$100.;
datalines4;
3M CO
3M CO
3M CO INC
A H BELO CORP
A O SMITH CORP
A O SMITH ELECTRICAL PRODUCTS
A SCHULMAN INC
AAR CORP
ABBOTT LABORATORIES
ABBOTT LABORATORIES
ABBOTT LABS
ABENGOA BIOENERGY
ACCURIDE CORP
ACETO CORP
ACME UNITED CORP
ACTUANT CORP
ACUTIY LIGHTING GROUP INC A D
ADM
ADOLPH COORS CO
ADVANSIX INC
AEROFLEX INC
AEROJET ROCKETDYNE HOLDINGS
AGCO CORP
AGCO INC
AGERE SYSTEMS
AGRIUM INC
AGRIUM US INC
AIR PRODS & CHEMICALS INC
AIR PRODUCTS & CHEMICALS INC
AK STEEL HOLDING CORP
ALBEMARLE CORP
ALCOA CORP
ALLEGHENY TECHNOLOGIES INC
ALLERGAN INC
ALLIED MOTION TECHNOLOGY
ALLIED-SIGNAL INC
ALON USA ENERGY INC
ALTAIR
ALTRIA GROUP INC
ALUMAX INC
AMCOL INTERNATIONAL CORP
AMERADA HESS CORP
AMERCO
AMERICAN AXLE & MANUFACTURING
AMERICAN BILTRITE INC
AMERICAN BUILDINGS CO
AMERICAN GREETINGS CORP
AMERICAN OUTDOOR BRANDS CORP
AMERICAN PACIFIC CORP
AMERICAN VANGUARD CORP
AMERICAN WOODMARK CORP
AMETEK INC
"AMETEK, INC"
AMPCO-PITTSBURGH CORP
AMPHENOL CORP
AMPHENOL INC
AMREP INC
AMTROL INC
ANACOMP INC
ANALOG DEVICES INC
ANCHOR GLASS CONTAINER CORP
ANDEAVOR
APOGEE ENT
APOGEE ENTERPRISES INC
APPLE COMPUTER INC
APPLIED MATERIALS INC
APPLIED POWER INC
APTARGROUP INC
ARCHER DANIELS MIDLAND
ARCHER DANIELS MIDLAND CO
ARCHER DANIELS MIDLAND CO
ARMSTRONG FLOORING INC
ARMSTRONG FLOORING INC
ARMSTRONG FLOORING INC.
ARMSTRONG WORLD INDL INC
ARMSTRONG WORLD INDUSTRIES
ARMSTRONG WORLD INDUSTRIES IN
ARROW INTERNATIONAL INC
ARVINMERITOR INC
ASH GROVE CEMENT CO
ASSOCIATED MATERIALS INC
ASTEC INDUSTRIES INC
AT&T CORP
AUTOMOTIVE COMPONENTS HOLDING
AUTOMOTIVE SAFETY COMPONENTS
AVAYA INC
AVERY DENNISON CORP
AXIALL CORP
AZZ INC
B/E AEROSPACE INC
BACARDI CORP
BADGER METER INC
BALCHEM CORP
BALL CORP
BANDAG INC
BARNES GROUP INC
BASSETT FURNITURE IND INC
BASSETT FURNITURE INDS INC
BASSETT FURNITURE INDUSTRIES
BAXTER HEALTHCARE CORP
BAXTER INTERNATIONAL INC
BE AEROSPACE
BE AEROSPACE INC
BECTON DICKINSON & CO
BEL FUSE INC
BELDEN CDT
BELDEN CDT INC
BELDEN INC
BELL INDUSTRIES INC
BEMIS CO INC
BENCHMARK ELECTRONICS INC
BERKSHIRE HATHAWAY
BERKSHIRE HATHAWAY
BERKSHIRE HATHAWAY INC
BERKSHIRE HATHAWAY INC
BERKSHIRE HATHAWAY INC
BIC CORP
BIC USA INC
BIO-RAD LABORATORIES INC
BLACK & DECKER CORP
BLYTH INDL INC
BLYTH INDL INC
BLYTH INDUSTRIES INC
BLYTH INDUSTRIES INC
BOEING CO
BOISE CASCADE CO
BOISE CASCADE CORP
BOMBARDIER MOTOR CORP OF AMER
BORGWARNER AUTOMOTIVE INC
BORGWARNER INC
BOSTON SCIENTIFIC CORP
BRADY CORP
BREED TECHS
BREED TECHS INC
BRIGGS & STRATTON CORP
BRISTOL-MYERS SQUIBB CO
BROWN FORMAN CORP
BROWN GROUP INC
BROWN-FORMAN CORP
BRUNSWICK CORP
BRUSH ENGINEERED MATERIALS
BUCKEYE PARTNERS L.P.
BUCKEYE PORT READING TERMINAL
BURKE-PARSONS-BOWLBY CORP
BURNHAM HOLDINGS INC
BWAY CORP
BWAY HOLDING CO
BWX TECHNOLOGIES INC
BWX TECHNOLOGIES INC.
C&D TECHNOLOGIES INC
;;;;
1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi Agent1592

 

Prxchange works fine.

The trick is the $ sign ending the search string, which instructs prxchange to look after any of the strings in group 2 only at the end of the input string, + trimming of input string to make last character = end-of-string.

 

The second prxchange is used to remove CO in the string "3M CO INC", after the first prxchange got rid of INC.

 

data want; set have;
	cn2 = COMPANY_NAME;
	cn2 = prxchange('s/(.*)(INC|INC\.|CORP|INC A D|CO)$/$1/',1,trim(cn2));
	cn2 = prxchange('s/(.*)(CO)$/$1/',1,trim(cn2));
run;

View solution in original post

11 REPLIES 11
ErikLund_Jensen
Rhodochrosite | Level 12

Hi Agent1592

 

Prxchange works fine.

The trick is the $ sign ending the search string, which instructs prxchange to look after any of the strings in group 2 only at the end of the input string, + trimming of input string to make last character = end-of-string.

 

The second prxchange is used to remove CO in the string "3M CO INC", after the first prxchange got rid of INC.

 

data want; set have;
	cn2 = COMPANY_NAME;
	cn2 = prxchange('s/(.*)(INC|INC\.|CORP|INC A D|CO)$/$1/',1,trim(cn2));
	cn2 = prxchange('s/(.*)(CO)$/$1/',1,trim(cn2));
run;
Agent1592
Pyrite | Level 9

Thank you. This code works. However if the company name is "AGCO CORP" after the trimming we are left with just "AG". CO gets cut from the company name although it should not. What is the solution to this?

Patrick
Opal | Level 21

If there is an exception to the rule then you have to specifically code for it. If these are only a few companies then just maintain a list of these names and don't run the RegEx over them.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi Agent1592

 

My fault. It is not specified in the prxchange that the search strings should be words, i.e. with a space character in front, and because of that CO is removed also if it is the end of a longer word. CO is in both prxchange1 and prxchange2, so "ACCO CORP" - change1 -> "AGCO" - change2 -> "AG".

 

Patrick's suggestion - to drop the second prxchange and just add more search strings to catch combinations like "CO CORP" - is a very good advice, but not foolproof, because the sloppy code I posted would still make "AGCO" - change1 -> "AG" and similar, if there is no company type specification after the name.

 

it can be solved by specifying a space character as the first char. of each search string, as \sCO|\sINC | etc. - Do you know the very useful  sas tip sheet: https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

 

 

 

Agent1592
Pyrite | Level 9

Thank you sir. I did not know about this.

Patrick
Opal | Level 21

@ErikLund_Jensen

If searching for words use \b which is the meta-character for a word boundary. This will then also work at the very beginning or end of a string or for cases where a word gets terminated by something like a .?! 

\bCO\b|\bINC\b| etc.

 

With PRXCHANGE() also consider to use the "i" switch so searches become case insensitive.

cn2 = prxchange('s/(.*)(CO)$/$1/i',1,trim(cn2));

 

Adubhai
Obsidian | Level 7

Hi ErikLund

 

Thanks for your code here. I had similar problems as Agent1592 and your code helped me solve my problem. However I still have another problem. Do you know how to modify the prxchange code so that it trim "THE" from the beginning of the company name? I tried 

cn2 = prxchange('s/(THE)(.*)$/$1/',1,trim(cn2));

But this code keeps "THE" and trim the rest of the name. I want it to keep everything else except "THE" in the beginning. 

Patrick
Opal | Level 21

@Adubhai 

 

cn2 = prxchange('s/^THE\b(.*)/$1/oi',1,strip(cn2));

 

Adubhai
Obsidian | Level 7
Thanks Patrick but it didn't work. Is it because I have compressed the names so there are no spaces after "THE" in the name?
Patrick
Opal | Level 21

Then remove the \b from the RegEx and be aware that this will also remove the THE from THEMES, THEATRE and so on.

Adubhai
Obsidian | Level 7
Yes now it worked. Thanks a lot. And yes I am aware about words like THEMES and THEATRE.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 5726 views
  • 10 likes
  • 4 in conversation