Dear SAS commnunity, I am trying to "clean up" company names by removing strings like CORP, INC, LP and so on from the end of the company name. I am also removing single letters at the end of the company name. The problem with the code is that it removes these strings if they are inside the company name. For example "TEXAS GENCO LP" after the code is applied "TEXAS GENCO" but "LP MINERAL LLC" is reduced incorrectly to "MINERAL". What is wrong with my code?
data test;
input cname $60.;
CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
CNAME1=prxchange('s/\s\S{1,2}\s*$//o', 1, CNAME1);
CNAME1 = prxchange("s/\b(INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB)\b//o", -1, CNAME1);
CNAME1=prxchange('s/\&$//',-1,strip(CNAME1));
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
;
In my experience PRXCHANGE can be slow in SAS. I suspect you might be better off by (1) extract the last word, (2) remove it if its length is 1, (3) otherwise remove it if it is in a list of unwanted words stored in a hash object.
I suspect this also makes for a easier way to maintain a dynamic list of unwanted last words in the company name:
data unwanted_words;
infile datalines dlm='|';
length word $8;
input word @@;
datalines;
INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB
run;
data want(drop=word);
input cname $60.;
if _n_=1 then do;
length word $8;
declare hash h (dataset:'unwanted_words');
h.definekey('word');
h.definedata('word');
h.definedone();
end;
CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
word=scan(cname1,-1,' ');
if length(word)=1 then substr(cname1,length(cname1),1)=' ';
else if h.find()=0 then
cname1=tranwrd(cats(cname,'!!'),cats(word,'!!'),' ');
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
run;
You might find the use of the SUBSTR function on the left side of an assignment expression unexpected:
if length(word)=1 then substr(cname1,length(cname1),1)=' ';
This is a well-documented but infrequently used feature of substr. In the above it inserts characters (in this case a single blank) into a specified character position [length(cname) here- i.e. last non-blank position] of the original character value.
you should allow for a space in front of LP to catch only those at the end of the string.
try this I only included a space for LP.
data test;
input cname $60.;
CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
CNAME2=prxchange('s/\s\S{1,2}\s*$//o', 1, CNAME1);
CNAME3 = prxchange("s/\b(INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA| LP|CORP|LTD|LT|CO|NV| LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB)\b//o", -1, CNAME2);
CNAME4=prxchange('s/\&$//',-1,strip(CNAME3));
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
;
Thank you!
In my experience PRXCHANGE can be slow in SAS. I suspect you might be better off by (1) extract the last word, (2) remove it if its length is 1, (3) otherwise remove it if it is in a list of unwanted words stored in a hash object.
I suspect this also makes for a easier way to maintain a dynamic list of unwanted last words in the company name:
data unwanted_words;
infile datalines dlm='|';
length word $8;
input word @@;
datalines;
INC|PA|PLC|II|III|U S|BV|MI|MICA|1 LLC|B V|GR|SP|LLC|US|USA|LP|CORP|LTD|LT|CO|NV|LP|L P|LL|IN|IND|COR|SA|KG|F K|F K A|AB
run;
data want(drop=word);
input cname $60.;
if _n_=1 then do;
length word $8;
declare hash h (dataset:'unwanted_words');
h.definekey('word');
h.definedata('word');
h.definedone();
end;
CNAME1 = compbl(translate(CNAME, " ", ";.,:()-/"));
word=scan(cname1,-1,' ');
if length(word)=1 then substr(cname1,length(cname1),1)=' ';
else if h.find()=0 then
cname1=tranwrd(cats(cname,'!!'),cats(word,'!!'),' ');
cards;
TEXAS GENCO LP
LP MINERAL LLC
A G SIMPSON AUTOMOTIVE &
A G SIMPSON AUTOMOTIVE A
A G SIMPSON AUTOMOTIVE I
A G SIMPSON AUTOMOTIVE C
A G SIMPSON AUTOMOTIVE U
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE H
A G SIMPSON AUTOMOTIVE PA
A G SIMPSON AUTOMOTIVE INC
run;
You might find the use of the SUBSTR function on the left side of an assignment expression unexpected:
if length(word)=1 then substr(cname1,length(cname1),1)=' ';
This is a well-documented but infrequently used feature of substr. In the above it inserts characters (in this case a single blank) into a specified character position [length(cname) here- i.e. last non-blank position] of the original character value.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.