Solved: Removing characters at the end of company names prxchange

Agent1592 · Posted 01-22-2019 06:39 PM

Dear SAS commnunity, I am trying to "clean up" company names by removing strings like CORP, INC, LP and so on from the end of the company name. I am also removing single letters at the end of the company name. The problem with the code is that it removes these strings if they are inside the company name. For example "TEXAS GENCO LP" after the code is applied "TEXAS GENCO" but "LP MINERAL LLC" is reduced incorrectly to "MINERAL". What is wrong with my code?

data test;
input cname $60.;
CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
CNAME1=prxchange('s/\s\S{1,2}\s*$//o', 1, CNAME1);
CNAME1 = prxchange("s/\b(INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB)\b//o", -1, CNAME1);
CNAME1=prxchange('s/\&$//',-1,strip(CNAME1));
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
;

mkeintz · Posted 01-22-2019 08:57 PM

In my experience PRXCHANGE can be slow in SAS. I suspect you might be better off by (1) extract the last word, (2) remove it if its length is 1, (3) otherwise remove it if it is in a list of unwanted words stored in a hash object.

I suspect this also makes for a easier way to maintain a dynamic list of unwanted last words in the company name:

data unwanted_words;
  infile datalines dlm='|';
  length word $8;
  input word @@;
datalines;
INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB
run;

data want(drop=word);
  input cname $60.;
  if _n_=1 then do;
    length word $8;
    declare hash h (dataset:'unwanted_words');
	  h.definekey('word');
	  h.definedata('word');
	  h.definedone();
  end;
  CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
  word=scan(cname1,-1,' ');
  if length(word)=1 then substr(cname1,length(cname1),1)=' ';
  else if h.find()=0 then 
    cname1=tranwrd(cats(cname,'!!'),cats(word,'!!'),' ');
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
run;

You might find the use of the SUBSTR function on the left side of an assignment expression unexpected:

if length(word)=1 then substr(cname1,length(cname1),1)=' ';

This is a well-documented but infrequently used feature of substr. In the above it inserts characters (in this case a single blank) into a specified character position [length(cname) here- i.e. last non-blank position] of the original character value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

VDD · Posted 01-22-2019 07:35 PM

you should allow for a space in front of LP to catch only those at the end of the string.

try this I only included a space for LP.

data test;
input cname $60.;
CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
CNAME2=prxchange('s/\s\S{1,2}\s*$//o', 1, CNAME1);
CNAME3 = prxchange("s/\b(INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA| LP|CORP|LTD|LT|CO|NV| LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB)\b//o", -1, CNAME2);
CNAME4=prxchange('s/\&$//',-1,strip(CNAME3));
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
;

Agent1592 · Posted 01-22-2019 11:05 PM

Thank you!

mkeintz · Posted 01-22-2019 08:57 PM

In my experience PRXCHANGE can be slow in SAS. I suspect you might be better off by (1) extract the last word, (2) remove it if its length is 1, (3) otherwise remove it if it is in a list of unwanted words stored in a hash object.

I suspect this also makes for a easier way to maintain a dynamic list of unwanted last words in the company name:

data unwanted_words;
  infile datalines dlm='|';
  length word $8;
  input word @@;
datalines;
INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB
run;

data want(drop=word);
  input cname $60.;
  if _n_=1 then do;
    length word $8;
    declare hash h (dataset:'unwanted_words');
	  h.definekey('word');
	  h.definedata('word');
	  h.definedone();
  end;
  CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
  word=scan(cname1,-1,' ');
  if length(word)=1 then substr(cname1,length(cname1),1)=' ';
  else if h.find()=0 then 
    cname1=tranwrd(cats(cname,'!!'),cats(word,'!!'),' ');
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
run;

You might find the use of the SUBSTR function on the left side of an assignment expression unexpected:

if length(word)=1 then substr(cname1,length(cname1),1)=' ';

This is a well-documented but infrequently used feature of substr. In the above it inserts characters (in this case a single blank) into a specified character position [length(cname) here- i.e. last non-blank position] of the original character value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Removing characters at the end of company names prxchange

Re: Removing characters at the end of company names prxchange

Re: Removing characters at the end of company names prxchange

Re: Removing characters at the end of company names prxchange

Re: Removing characters at the end of company names prxchange

Removing characters at the end of company names prxchange

Re: Removing characters at the end of company names prxchange

Re: Removing characters at the end of company names prxchange

Re: Removing characters at the end of company names prxchange

Re: Removing characters at the end of company names prxchange

Click image to register for webinar

Classroom Training Available!