Solved: Re: Removing abbreviations in firms names

jimmychoi · Posted 02-12-2019 09:39 AM

Hi all,

I have a list of abbreviations, as below,

data abbrev;
infile datalines truncover;
input word $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;

and I also have list of firm names, that has some company names ending with abbreviations of above.

For each name among the firm names, I want to iterate through the dataset 'abbrev' and see if the firm name is ending with the any of the abbreviation. If it does, then simply remove the abbreviation.

please help.

andreas_lds · Posted 02-12-2019 10:01 AM

Can you post some firm names as data-step using datalines, so that we have something to play with?

To be 100% sure: if one of the abbreviations appears in the middle of a company name, the abbreviation is not removed?

EDIT: I would load the dataset abbrev in a hash-object, defining word as key. The use

word = scan(firm_name, -1);

to get the last word of each name, check if that word is in the hash-object and finally use prxchange to remove the word from the name:

firm_name = prxchange(cats('s/(.*)\W(', word, ')/$1/'), 1, trim(firm_name));

That should not be to much code to write and perform somewhat fast as long as the number of obs in abbrev is not to high.

View solution in original post

andreas_lds · Posted 02-12-2019 10:01 AM

Can you post some firm names as data-step using datalines, so that we have something to play with?

To be 100% sure: if one of the abbreviations appears in the middle of a company name, the abbreviation is not removed?

EDIT: I would load the dataset abbrev in a hash-object, defining word as key. The use

word = scan(firm_name, -1);

to get the last word of each name, check if that word is in the hash-object and finally use prxchange to remove the word from the name:

firm_name = prxchange(cats('s/(.*)\W(', word, ')/$1/'), 1, trim(firm_name));

That should not be to much code to write and perform somewhat fast as long as the number of obs in abbrev is not to high.

novinosrin · Posted 02-12-2019 10:08 AM

exactly +1 from me

noling · Posted 02-12-2019 10:59 AM

Here's a brute force method that will work if both your datasets are small. Similar but less efficient to @Andreas_Ids 's hash method:

result:

data abbrev;
infile datalines truncover;
input word $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;

data firms;
infile datalines truncover;
input firm $50.;
firm=strip(firm);
datalines;
SAS AG
GOOGLEBV
not a match
BOSCHINC
my company LIMITED
also not a match
this has 2 abbrevs HOLDING INC
;
run;

*get abbreviations into a macro -assuming that you only have these 24;
*using a ^ as delimeter since some of your abbrevs have spaces;
proc sql noprint;
	select word into :abbrevs separated by '^'
	from abbrev;
quit;
%put &abbrevs;
%let to_loop = %eval(%sysfunc(countc(&abbrevs., "^"))+1);
%put &to_loop;
data final_firms;
	set firms;
	length firm_updated $32.;
	
	match='false';
	i=0;
	*check each firm name against the list of abbreviations;
	do while (match='false' and i < &to_loop);
		i +1; 

		*get current abbreviation and length ;
	    abbrev=scan("&abbrevs", i, "^");
		abbrev_len=(length(abbrev));

		*check for match at end of firm name;
		if strip(upcase(substr(firm, length(firm) - abbrev_len +1))) = strip(upcase(abbrev)) then do;
			firm_updated = substr(firm,1, length(firm) - abbrev_len);
			match='true';
		end;
		else firm_updated=firm;
    end;

	keep firm firm_updated;
run;

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Removing abbreviations in firms names

Re: Removing abbreviations in firms names

Re: Removing abbreviations in firms names

Re: Removing abbreviations in firms names

Re: Removing abbreviations in firms names

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away