BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Agent1592
Pyrite | Level 9

Dear SAS commnunity, I am trying to "clean up" company names by removing strings like CORP, INC, LP and so on from the end of the company name. I am also removing single letters at the end of the company name. The problem with the code is that it removes these strings if they are inside the company name. For example "TEXAS GENCO LP" after the code is applied "TEXAS GENCO" but "LP MINERAL LLC" is reduced incorrectly to "MINERAL". What is wrong with my code?

 

 

data test;
input cname $60.;
CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
CNAME1=prxchange('s/\s\S{1,2}\s*$//o', 1, CNAME1);
CNAME1 = prxchange("s/\b(INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB)\b//o", -1, CNAME1);
CNAME1=prxchange('s/\&$//',-1,strip(CNAME1));
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
;

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

In my experience PRXCHANGE can be slow in SAS.  I suspect you might be better off by (1) extract the last word, (2) remove it if its length is 1, (3) otherwise remove it if it is in a list of unwanted words stored in a hash object.

 

I suspect this also makes for a easier way to maintain a dynamic list of unwanted last words in the company name:

 

data unwanted_words;
  infile datalines dlm='|';
  length word $8;
  input word @@;
datalines;
INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB
run;

data want(drop=word);
  input cname $60.;
  if _n_=1 then do;
    length word $8;
    declare hash h (dataset:'unwanted_words');
	  h.definekey('word');
	  h.definedata('word');
	  h.definedone();
  end;
  CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
  word=scan(cname1,-1,' ');
  if length(word)=1 then substr(cname1,length(cname1),1)=' ';
  else if h.find()=0 then 
    cname1=tranwrd(cats(cname,'!!'),cats(word,'!!'),' ');
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
run;

 

You might find the use of the SUBSTR function on the left side of an assignment expression unexpected:

     if length(word)=1 then substr(cname1,length(cname1),1)=' ';

 

This is a well-documented but infrequently used feature of substr.  In the above it inserts characters (in this case a single blank) into a specified character position [length(cname) here- i.e. last non-blank position] of the original character value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

3 REPLIES 3
VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

you should allow for a space in front of LP to catch only those at the end of the string.

try this I only included a space for LP.

 

 

data test;
input cname $60.;
CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
CNAME2=prxchange('s/\s\S{1,2}\s*$//o', 1, CNAME1);
CNAME3 = prxchange("s/\b(INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA| LP|CORP|LTD|LT|CO|NV| LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB)\b//o", -1, CNAME2);
CNAME4=prxchange('s/\&$//',-1,strip(CNAME3));
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
;
mkeintz
PROC Star

In my experience PRXCHANGE can be slow in SAS.  I suspect you might be better off by (1) extract the last word, (2) remove it if its length is 1, (3) otherwise remove it if it is in a list of unwanted words stored in a hash object.

 

I suspect this also makes for a easier way to maintain a dynamic list of unwanted last words in the company name:

 

data unwanted_words;
  infile datalines dlm='|';
  length word $8;
  input word @@;
datalines;
INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB
run;

data want(drop=word);
  input cname $60.;
  if _n_=1 then do;
    length word $8;
    declare hash h (dataset:'unwanted_words');
	  h.definekey('word');
	  h.definedata('word');
	  h.definedone();
  end;
  CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
  word=scan(cname1,-1,' ');
  if length(word)=1 then substr(cname1,length(cname1),1)=' ';
  else if h.find()=0 then 
    cname1=tranwrd(cats(cname,'!!'),cats(word,'!!'),' ');
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
run;

 

You might find the use of the SUBSTR function on the left side of an assignment expression unexpected:

     if length(word)=1 then substr(cname1,length(cname1),1)=' ';

 

This is a well-documented but infrequently used feature of substr.  In the above it inserts characters (in this case a single blank) into a specified character position [length(cname) here- i.e. last non-blank position] of the original character value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 1166 views
  • 3 likes
  • 3 in conversation