Solved: removing abbreviations in firm names

jimmychoi · Posted 02-05-2019 09:36 AM

Hi all,

i'm trying to match two different firm names using COMPGED (maybe SPECID, SOUNDEX can be used as alternative method)

but before that, I am thinking of making firm names similar as possible, by removing abbreviations at the end

(e.g: CO LTD, PTE LTD, Limited, INC, Incorporated, AG, SpA, Corp)

simplest way would be using the function TRANWRD, but i'm afraid this would replace not only abbreviations but letters that are part of the firm names. (say, if I was trying to remove 'Corp' at the end of firm names but by using TRANWRD i made 'Corpastta SpA' to 'astta SpA')

Thus, what is the best way to do this and has anyone done the same work as me?

maybe I should use reg expression?

SuryaKiran · Posted 02-05-2019 12:04 PM

Hello,

You can use perl regular expression for pattern matching.

data have;
infile datalines truncover;
input word $50.;
datalines;
Corpastta AB Crop
Corpastta Crop AB
AB Corpastta Crop
AB Corpastta
Crop AB Corpastta
ABCrop Corpastta
;
run;

data want;
set have;
position=prxmatch('m/ Crop | Crop|^Crop /io',word);
new_word1=ifc(position^=0,ifc(position>1,substr(word,1,prxmatch('m/ Crop | Crop|^Crop /io',word)-1),''),word);
new_word2=ifc(position^=0,substr(word,prxmatch('m/ Crop | Crop|^Crop /io',word)+5),'');
required_word=catx(' ',new_word1,new_word2);
run;

You need to include the blanks for the strings that your looking for.

'm/ Crop | Crop|^Crop /io'

| | |_ ^(cap) for starting of the word and blank at the end.

| |_______ Starting with blank and ends the line

|_______________ Blank at starting and ending.

Thanks,
Suryakiran

View solution in original post

andreas_lds · Posted 02-05-2019 09:48 AM

Please post example data in a usable form. See https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... for details on how to create usable data.

RW9 · Posted 02-05-2019 09:57 AM

If it has delimeters, then use that, e.g:

data want;
  length want $200;
  test="Something co";
  do i=1 to countw(test," ");
    if scan(test,i," ") ne "co" then want=catx(" ",want,scan(test,i," "));
  end;
run;

Of course that is only showing one removal and with spaces, but you get the idea, and no test data in the form of a datastep prevents anything further.

SuryaKiran · Posted 02-05-2019 12:04 PM

Hello,

You can use perl regular expression for pattern matching.

data have;
infile datalines truncover;
input word $50.;
datalines;
Corpastta AB Crop
Corpastta Crop AB
AB Corpastta Crop
AB Corpastta
Crop AB Corpastta
ABCrop Corpastta
;
run;

data want;
set have;
position=prxmatch('m/ Crop | Crop|^Crop /io',word);
new_word1=ifc(position^=0,ifc(position>1,substr(word,1,prxmatch('m/ Crop | Crop|^Crop /io',word)-1),''),word);
new_word2=ifc(position^=0,substr(word,prxmatch('m/ Crop | Crop|^Crop /io',word)+5),'');
required_word=catx(' ',new_word1,new_word2);
run;

You need to include the blanks for the strings that your looking for.

'm/ Crop | Crop|^Crop /io'

| | |_ ^(cap) for starting of the word and blank at the end.

| |_______ Starting with blank and ends the line

|_______________ Blank at starting and ending.

Thanks,
Suryakiran

removing abbreviations in firm names

Re: removing abbreviations in firm names

Re: removing abbreviations in firm names

Re: removing abbreviations in firm names

Re: removing abbreviations in firm names

removing abbreviations in firm names

Re: removing abbreviations in firm names

Re: removing abbreviations in firm names

Re: removing abbreviations in firm names

Re: removing abbreviations in firm names

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!