Solved: Removing abbreviations in firms names (more questions)

jimmychoi · Posted 02-14-2019 08:59 AM

Hi all,

I have a list of abbreviations, as below,

data abbrev;
infile datalines truncover;
input abb $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;

and I also have list of firm names for example:

and the abbreviations have a blank (space) in front of them.

data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA
AXXESS COMPOUNDING LP
AXXIMA PHARMACEUTICALS AG
BEAR STEARNS HEALTH INNOVENTURES LP
BEAUBRIDGE LLP
CANNAVEST CORP
CARESTIA SA
COMIFAR SPA
DUTALYS GMBH
ESSEX CHEMIE AG
IMAGING3 INC
;
run;

desired output:

AXXAM
AXXESS COMPOUNDING
AXXIMA PHARMACEUTICALS
BEAR STEARNS HEALTH INNOVENTURES
BEAUBRIDGE
CANNAVEST
CARESTIA
COMIFAR
DUTALYS
ESSEX CHEMIE
IMAGING3

in my previous post andreas_lds recommended to use hash and regex,

thus I was writing the code as below but now i'm stuck in there 😞

data want;
	if _n_=1 then do;
	if 0 then set abbrev;
	   declare hash H (dataset:'abbrev',multidata:'y') ;
	   h.definekey  ("abb") ;
	   h.definedata ("abb") ;
	   h.definedone () ;
	end;
	set firms;
        firm_abb = scan(firmname, -1);
        do rc=h.find(key:firm_abb) by 0 while(rc=0);
            new_firm_name = prxchange(cats('s/(\w+)(word)/$1/'), 1, trim(firm_name));
        end;
run;

this is first time writing perl regex thus i really don't know if it's the right expression.

KachiM · Posted 02-15-2019 02:06 AM

length one 8;

This statement can be deleted. Originally I wanted to force the Data Part of Hash Item to use only 8 bytes to save memory. The default Data Part when the h.DefineData() is not used, will take as many bytes as the Key Part of hash item. Since Data set is loaded while hash object is defined, it is not possible to make the Data Part to take the variable 'one'. This can be achieved if the we load the hash table one item at a time. Each item (namely ABB) when loaded takes 96 bytes in the following program. In the previous program the item size took 144 bytes - a saving of 48 bytes for each item. However, our aim to Lookup the hash table is not affected.

data want;
   if _n_ = 1 then do;
      if 0 then set abbrev;
      length one 8;
      one = 10;
      declare hash h();
      h.definekey('abb');
      h.definedata('one');
      h.definedone();
      do until(eof);
         set abbrev end = eof;
         rc = h.add();
      end;
      *mem1_used = h.item_size;
      *put mem1_used =;
  end;
  length word $50;
  set firms;
  nw = countw(firmname, ' ');
  word = trim(scan(firmname,nw));
  rc = h.find(key:word);
  if rc = 0 then firmname = tranwrd(firmname,word, ' ');
keep firmname;
run;

Your second query to add an 'else part' is not needed. If the 'word' is found then only TRANWRD() function changes 'firmname', otherwise the 'firmname' remains the same. Note that firmname is not assigned to any other Name.

I hope you used my earlier program. You can use this program given above where 'else clause' is not used. I tested with:

data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA
AXXESS COMPOUNDING LP
AXXIMA PHARMACEUTICALS AG
BEAR STEARNS HEALTH INNOVENTURES LP
BEAUBRIDGE LLP
CANNAVEST CORP
CARESTIA SA
COMIFAR SPA
DUTALYS GMBH
ESSEX CHEMIE AG
IMAGING3 INC
jimmychoi
;
run;

Hope this helps your queries.

Best regards

DATAsp

View solution in original post

KachiM · Posted 02-14-2019 10:57 AM

The following steps be followed:

Save 'abb"s to a hash table.

Read firmname, take the last word.

Lookup the Hash Table. If found translate the word to blanks in firmname.

data abbrev;
infile datalines truncover;
input abb $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;

data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA
AXXESS COMPOUNDING LP
AXXIMA PHARMACEUTICALS AG
BEAR STEARNS HEALTH INNOVENTURES LP
BEAUBRIDGE LLP
CANNAVEST CORP
CARESTIA SA
COMIFAR SPA
DUTALYS GMBH
ESSEX CHEMIE AG
IMAGING3 INC;
run;

data want;
   if _n_ = 1 then do;
      if 0 then set abbrev;
      length one 8; 
      declare hash h(dataset:'abbrev');
      h.definekey('abb');
      h.definedone();
  end;
  length word $50;
  set firms;
  nw = countw(firmname, ' ');
  word = trim(scan(firmname,nw));
  rc = h.find(key:word);
  if rc = 0 then firmname = tranwrd(firmname,word, ' ');
keep firmname;
run;

proc print data = want;
run;

Obs 	firmname
1 	AXXESS COMPOUNDING
2 	AXXIMA PHARMACEUTICALS
3 	BEAR STEARNS HEALTH INNOVENTURES
4 	BEAUBRIDGE LLP
5 	CANNAVEST CORP
6 	CARESTIA
7 	COMIFAR
8 	DUTALYS
9 	ESSEX CHEMIE

DATAsp

jimmychoi · Posted 02-15-2019 12:32 AM

Hi datasp, two quick questions:

1. what is this line for?

    length one 8;

2. For firm names that do not contain abbreviations, I get blank in the firm names, which means their entire name is deleted. Shall I write something like,

if rc = 0 then firmname = tranwrd(firmname,word, ' ');
else if rc ne 0 then firmname;

to keep the original name?

KachiM · Posted 02-15-2019 02:06 AM

length one 8;

This statement can be deleted. Originally I wanted to force the Data Part of Hash Item to use only 8 bytes to save memory. The default Data Part when the h.DefineData() is not used, will take as many bytes as the Key Part of hash item. Since Data set is loaded while hash object is defined, it is not possible to make the Data Part to take the variable 'one'. This can be achieved if the we load the hash table one item at a time. Each item (namely ABB) when loaded takes 96 bytes in the following program. In the previous program the item size took 144 bytes - a saving of 48 bytes for each item. However, our aim to Lookup the hash table is not affected.

data want;
   if _n_ = 1 then do;
      if 0 then set abbrev;
      length one 8;
      one = 10;
      declare hash h();
      h.definekey('abb');
      h.definedata('one');
      h.definedone();
      do until(eof);
         set abbrev end = eof;
         rc = h.add();
      end;
      *mem1_used = h.item_size;
      *put mem1_used =;
  end;
  length word $50;
  set firms;
  nw = countw(firmname, ' ');
  word = trim(scan(firmname,nw));
  rc = h.find(key:word);
  if rc = 0 then firmname = tranwrd(firmname,word, ' ');
keep firmname;
run;

Your second query to add an 'else part' is not needed. If the 'word' is found then only TRANWRD() function changes 'firmname', otherwise the 'firmname' remains the same. Note that firmname is not assigned to any other Name.

I hope you used my earlier program. You can use this program given above where 'else clause' is not used. I tested with:

data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA
AXXESS COMPOUNDING LP
AXXIMA PHARMACEUTICALS AG
BEAR STEARNS HEALTH INNOVENTURES LP
BEAUBRIDGE LLP
CANNAVEST CORP
CARESTIA SA
COMIFAR SPA
DUTALYS GMBH
ESSEX CHEMIE AG
IMAGING3 INC
jimmychoi
;
run;

Hope this helps your queries.

Best regards

DATAsp

kiranv_ · Posted 02-14-2019 11:28 AM

i am just keeping the changes made. I have to do SA\/NV, as \ is escape character other wise prxchange will have issue as it things / as end of prxchange argument1.

cats("\b[a-zA-Z]*",abb,"[a-zA-Z]*$") into macro variable resolves (showing only partial value.

\b[a-zA-Z]*AG[a-zA-Z]*$|\b[a-zA-Z]*BV[a-zA-Z]*$ | is for or

\b[a-zA-Z]*AG[a-zA-Z]*$

\b=border can be start or end of word

[a-zA-Z]* alphabets can be there or not there

AG is your corporate word

[a-zA-Z]*$ alphabets can be there or not there as last word

in short we are catching last word and seeing whether any alphabets are there infront or back of it. and by using | we are telling to check for all the last words for your corporation(from your lookup table.)

data abbrev;
infile datalines truncover;
input abb $50.;
datalines;
SA\/NV
;
run;

proc sql;
select cats("\b[a-zA-Z]*",abb,"[a-zA-Z]*$") into :allabbs separated by "|"
from abbrev;


data final;
set firms;

new_firmname = compress(prxchange("s/&allabbs//", -1, trim(firmname)),"/");
run;

proc print;

KachiM · Posted 02-14-2019 11:52 AM

In my previous answer, the "ABBREV" Data Set does not include "CORP" and "LLP". So firmnames with these two "abb"s are not removed.

"CORP" and "CORPORATION" is not same for matching. The OP can Add these two to get the correct answer. There is no change in the program.

DATAsp

andreas_lds · Posted 02-15-2019 02:07 AM

Sorry, i am late 😉

You should add

GOOGLE V LP
MICROSOFT SV/NV

to the datalines of firms, so that the difficult things are included.

With blanks in the abbreviations using a hash objects seems to be of no help. Because you can't get the complete abbreviation with

firm_abb = scan(firmname, -1);

you can't use firm_abb as key for the hash-object.

This works without having blanks in variable abb:

data want;
   if 0 then set work.abbrev;
   set work.firms;

   if _n_ = 1 then do;
      declare hash h(dataset: 'work.abbrev');
      h.defineKey('abb');
      h.defineDone();
   end;

   abb = scan(firmname, -1);

   if h.check() = 0 then do;
      firmname = prxchange(cats('s/(.*)\W', abb,'/$1/'), 1, firmname);
   end;
run;

With the new setup i skip hash and use a loop reading work.abbrev with point-option:

data abbrev;
   infile datalines truncover;
   input abb $50.;

/* see comment blow code */
   lenny = length(abb);

   datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;


proc sort data=work.abbrev;
   by descending lenny;
run;


data want;
   set work.firms;

   i = 1;

   do while (i <= nobs);
      set work.abbrev point=i nobs=nobs;

      if prxmatch(cat('/(.*) ', trim(tranwrd(abb, '/', '\/')), '/'), trim(firmname)) then do;
         firmname = prxchange(cat('s/(.*) ', trim(tranwrd(abb, '/', '\/')), '/$1/'), 1, firmname);
         i = nobs + 1;
      end;

      i = i + 1;
   end;

   drop abb;
run;

This is not giving the expected results for "BEAUBRIDGE LLP" because "LLP" is not in the list at all. The sorting is required so that "V LP" is checked before "LP". I am not sure that this will work for abbreviations not on the current list.

Removing abbreviations in firms names (more questions)

Re: Removing abbreviations in firms names (more questions)

Re: Removing abbreviations in firms names (more questions)

Re: Removing abbreviations in firms names (more questions)

Re: Removing abbreviations in firms names (more questions)

Re: Removing abbreviations in firms names (more questions)

Re: Removing abbreviations in firms names (more questions)

Re: Removing abbreviations in firms names (more questions)

Classroom Training Available!