Hi all,
I have a list of abbreviations, as below,
data abbrev;
infile datalines truncover;
input word $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;
and I also have list of firm names, that has some company names ending with abbreviations of above.
For each name among the firm names, I want to iterate through the dataset 'abbrev' and see if the firm name is ending with the any of the abbreviation. If it does, then simply remove the abbreviation.
please help.
Can you post some firm names as data-step using datalines, so that we have something to play with?
To be 100% sure: if one of the abbreviations appears in the middle of a company name, the abbreviation is not removed?
EDIT: I would load the dataset abbrev in a hash-object, defining word as key. The use
word = scan(firm_name, -1);
to get the last word of each name, check if that word is in the hash-object and finally use prxchange to remove the word from the name:
firm_name = prxchange(cats('s/(.*)\W(', word, ')/$1/'), 1, trim(firm_name));
That should not be to much code to write and perform somewhat fast as long as the number of obs in abbrev is not to high.
Can you post some firm names as data-step using datalines, so that we have something to play with?
To be 100% sure: if one of the abbreviations appears in the middle of a company name, the abbreviation is not removed?
EDIT: I would load the dataset abbrev in a hash-object, defining word as key. The use
word = scan(firm_name, -1);
to get the last word of each name, check if that word is in the hash-object and finally use prxchange to remove the word from the name:
firm_name = prxchange(cats('s/(.*)\W(', word, ')/$1/'), 1, trim(firm_name));
That should not be to much code to write and perform somewhat fast as long as the number of obs in abbrev is not to high.
exactly +1 from me
Here's a brute force method that will work if both your datasets are small. Similar but less efficient to @Andreas_Ids 's hash method:
result:
data abbrev;
infile datalines truncover;
input word $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;
data firms;
infile datalines truncover;
input firm $50.;
firm=strip(firm);
datalines;
SAS AG
GOOGLEBV
not a match
BOSCHINC
my company LIMITED
also not a match
this has 2 abbrevs HOLDING INC
;
run;
*get abbreviations into a macro -assuming that you only have these 24;
*using a ^ as delimeter since some of your abbrevs have spaces;
proc sql noprint;
select word into :abbrevs separated by '^'
from abbrev;
quit;
%put &abbrevs;
%let to_loop = %eval(%sysfunc(countc(&abbrevs., "^"))+1);
%put &to_loop;
data final_firms;
set firms;
length firm_updated $32.;
match='false';
i=0;
*check each firm name against the list of abbreviations;
do while (match='false' and i < &to_loop);
i +1;
*get current abbreviation and length ;
abbrev=scan("&abbrevs", i, "^");
abbrev_len=(length(abbrev));
*check for match at end of firm name;
if strip(upcase(substr(firm, length(firm) - abbrev_len +1))) = strip(upcase(abbrev)) then do;
firm_updated = substr(firm,1, length(firm) - abbrev_len);
match='true';
end;
else firm_updated=firm;
end;
keep firm firm_updated;
run;
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.