BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jimmychoi
Obsidian | Level 7

Hi all,

 

I have a list of abbreviations, as below,

 

data abbrev;
infile datalines truncover;
input abb $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;

 


and I also have list of firm names for example:

and the abbreviations have a blank (space) in front of them.

 

data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA AXXESS COMPOUNDING LP AXXIMA PHARMACEUTICALS AG BEAR STEARNS HEALTH INNOVENTURES LP BEAUBRIDGE LLP CANNAVEST CORP CARESTIA SA COMIFAR SPA DUTALYS GMBH ESSEX CHEMIE AG IMAGING3 INC
; run;

 

 

desired output:

 

AXXAM
AXXESS COMPOUNDING
AXXIMA PHARMACEUTICALS
BEAR STEARNS HEALTH INNOVENTURES
BEAUBRIDGE
CANNAVEST
CARESTIA
COMIFAR
DUTALYS
ESSEX CHEMIE
IMAGING3

 

in my previous post  recommended to use hash and regex,

thus I was writing the code as below but now i'm stuck in there 😞

 

data want;
	if _n_=1 then do;
	if 0 then set abbrev;
	   declare hash H (dataset:'abbrev',multidata:'y') ;
	   h.definekey  ("abb") ;
	   h.definedata ("abb") ;
	   h.definedone () ;
	end;
	set firms;
firm_abb = scan(firmname, -1);
do rc=h.find(key:firm_abb) by 0 while(rc=0); new_firm_name = prxchange(cats('s/(\w+)(word)/$1/'), 1, trim(firm_name));
end; run;

this is first time writing perl regex thus i really don't know if it's the right expression.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
KachiM
Rhodochrosite | Level 12
length one 8;

This statement can be deleted. Originally I wanted to force the Data Part of Hash Item to use only 8 bytes to save memory. The default Data Part when the h.DefineData() is not used, will take as many bytes as the Key Part of hash item. Since Data set is loaded while hash object is defined, it is not possible to make the Data Part to take the variable 'one'. This can be achieved if the we load the hash table one item at a time. Each item (namely ABB) when loaded takes 96 bytes in the following program. In the previous program the item size took 144 bytes - a saving of 48 bytes for each item. However, our aim to Lookup the hash table is not affected.

 

data want;
   if _n_ = 1 then do;
      if 0 then set abbrev;
      length one 8;
      one = 10;
      declare hash h();
      h.definekey('abb');
      h.definedata('one');
      h.definedone();
      do until(eof);
         set abbrev end = eof;
         rc = h.add();
      end;
      *mem1_used = h.item_size;
      *put mem1_used =;
  end;
  length word $50;
  set firms;
  nw = countw(firmname, ' ');
  word = trim(scan(firmname,nw));
  rc = h.find(key:word);
  if rc = 0 then firmname = tranwrd(firmname,word, ' ');
keep firmname;
run;

 Your second query to add an 'else part' is not needed. If the 'word' is found then only TRANWRD() function changes 'firmname', otherwise the 'firmname' remains the same. Note that firmname is not assigned to any other Name.

 

I hope you used my earlier program. You can use this program given above  where 'else clause' is not used. I tested with:

 

data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA
AXXESS COMPOUNDING LP
AXXIMA PHARMACEUTICALS AG
BEAR STEARNS HEALTH INNOVENTURES LP
BEAUBRIDGE LLP
CANNAVEST CORP
CARESTIA SA
COMIFAR SPA
DUTALYS GMBH
ESSEX CHEMIE AG
IMAGING3 INC
jimmychoi
;
run;

Hope this helps your queries.

 

Best regards

DATAsp

 

View solution in original post

6 REPLIES 6
KachiM
Rhodochrosite | Level 12

The following steps be followed:

Save 'abb"s to a hash table.

Read firmname, take the last word.

Lookup the Hash Table. If found translate the word to blanks in firmname.

 

data abbrev;
infile datalines truncover;
input abb $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;

data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA
AXXESS COMPOUNDING LP
AXXIMA PHARMACEUTICALS AG
BEAR STEARNS HEALTH INNOVENTURES LP
BEAUBRIDGE LLP
CANNAVEST CORP
CARESTIA SA
COMIFAR SPA
DUTALYS GMBH
ESSEX CHEMIE AG
IMAGING3 INC;
run;

data want;
   if _n_ = 1 then do;
      if 0 then set abbrev;
      length one 8; 
      declare hash h(dataset:'abbrev');
      h.definekey('abb');
      h.definedone();
  end;
  length word $50;
  set firms;
  nw = countw(firmname, ' ');
  word = trim(scan(firmname,nw));
  rc = h.find(key:word);
  if rc = 0 then firmname = tranwrd(firmname,word, ' ');
keep firmname;
run;

proc print data = want;
run;

Obs 	firmname
1 	AXXESS COMPOUNDING
2 	AXXIMA PHARMACEUTICALS
3 	BEAR STEARNS HEALTH INNOVENTURES
4 	BEAUBRIDGE LLP
5 	CANNAVEST CORP
6 	CARESTIA
7 	COMIFAR
8 	DUTALYS
9 	ESSEX CHEMIE

DATAsp

jimmychoi
Obsidian | Level 7

Hi datasp, two quick questions:

1. what is this line for?

    length one 8;

 

2. For firm names that do not contain abbreviations, I get blank in the firm names, which means their entire name is deleted. Shall I write something like, 

if rc = 0 then firmname = tranwrd(firmname,word, ' ');
else if rc ne 0 then firmname;

to keep the original name?

KachiM
Rhodochrosite | Level 12
length one 8;

This statement can be deleted. Originally I wanted to force the Data Part of Hash Item to use only 8 bytes to save memory. The default Data Part when the h.DefineData() is not used, will take as many bytes as the Key Part of hash item. Since Data set is loaded while hash object is defined, it is not possible to make the Data Part to take the variable 'one'. This can be achieved if the we load the hash table one item at a time. Each item (namely ABB) when loaded takes 96 bytes in the following program. In the previous program the item size took 144 bytes - a saving of 48 bytes for each item. However, our aim to Lookup the hash table is not affected.

 

data want;
   if _n_ = 1 then do;
      if 0 then set abbrev;
      length one 8;
      one = 10;
      declare hash h();
      h.definekey('abb');
      h.definedata('one');
      h.definedone();
      do until(eof);
         set abbrev end = eof;
         rc = h.add();
      end;
      *mem1_used = h.item_size;
      *put mem1_used =;
  end;
  length word $50;
  set firms;
  nw = countw(firmname, ' ');
  word = trim(scan(firmname,nw));
  rc = h.find(key:word);
  if rc = 0 then firmname = tranwrd(firmname,word, ' ');
keep firmname;
run;

 Your second query to add an 'else part' is not needed. If the 'word' is found then only TRANWRD() function changes 'firmname', otherwise the 'firmname' remains the same. Note that firmname is not assigned to any other Name.

 

I hope you used my earlier program. You can use this program given above  where 'else clause' is not used. I tested with:

 

data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA
AXXESS COMPOUNDING LP
AXXIMA PHARMACEUTICALS AG
BEAR STEARNS HEALTH INNOVENTURES LP
BEAUBRIDGE LLP
CANNAVEST CORP
CARESTIA SA
COMIFAR SPA
DUTALYS GMBH
ESSEX CHEMIE AG
IMAGING3 INC
jimmychoi
;
run;

Hope this helps your queries.

 

Best regards

DATAsp

 

kiranv_
Rhodochrosite | Level 12

i am just keeping the changes made. I have to do SA\/NV, as \ is escape character  other wise prxchange will have issue as  it things / as end of prxchange argument1.

 

cats("\b[a-zA-Z]*",abb,"[a-zA-Z]*$") into macro variable resolves (showing only partial value.

 

\b[a-zA-Z]*AG[a-zA-Z]*$|\b[a-zA-Z]*BV[a-zA-Z]*$   | is for or 

 

\b[a-zA-Z]*AG[a-zA-Z]*$     

 

\b=border can be start or end of word 

[a-zA-Z]* alphabets can be there or not there

AG is your corporate word

[a-zA-Z]*$ alphabets can be there or not there as last word

 

in short we are catching last word and seeing whether any alphabets are there infront or back of it. and by using | we are telling to check for all the last words for your corporation(from your lookup table.)

 

 

 

 

 

data abbrev;
infile datalines truncover;
input abb $50.;
datalines;
SA\/NV
;
run;

proc sql;
select cats("\b[a-zA-Z]*",abb,"[a-zA-Z]*$") into :allabbs separated by "|"
from abbrev;


data final;
set firms;

new_firmname = compress(prxchange("s/&allabbs//", -1, trim(firmname)),"/");
run;

proc print;


Screen Shot 2019-02-14 at 11.27.38 AM.png

KachiM
Rhodochrosite | Level 12

In my previous answer, the "ABBREV" Data Set does not include "CORP" and "LLP". So firmnames with these two "abb"s are not removed.

"CORP" and "CORPORATION" is not same for matching. The OP can Add these two to get the correct answer. There is no change in the program.

DATAsp

andreas_lds
Jade | Level 19

Sorry, i am late 😉

 

You should add

GOOGLE V LP
MICROSOFT SV/NV

to the datalines of firms, so that the difficult things are included.

 

With blanks in the abbreviations using a hash objects seems to be of no help. Because you can't get the complete abbreviation with

firm_abb = scan(firmname, -1);

you can't use firm_abb as key for the hash-object.

 

This works without having blanks in variable abb:

data want;
   if 0 then set work.abbrev;
   set work.firms;

   if _n_ = 1 then do;
      declare hash h(dataset: 'work.abbrev');
      h.defineKey('abb');
      h.defineDone();
   end;

   abb = scan(firmname, -1);

   if h.check() = 0 then do;
      firmname = prxchange(cats('s/(.*)\W', abb,'/$1/'), 1, firmname);
   end;
run;

With the new setup i skip hash and use a loop reading work.abbrev with point-option:

data abbrev;
   infile datalines truncover;
   input abb $50.;

/* see comment blow code */
   lenny = length(abb);

   datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;


proc sort data=work.abbrev;
   by descending lenny;
run;


data want;
   set work.firms;

   i = 1;

   do while (i <= nobs);
      set work.abbrev point=i nobs=nobs;

      if prxmatch(cat('/(.*) ', trim(tranwrd(abb, '/', '\/')), '/'), trim(firmname)) then do;
         firmname = prxchange(cat('s/(.*) ', trim(tranwrd(abb, '/', '\/')), '/$1/'), 1, firmname);
         i = nobs + 1;
      end;

      i = i + 1;
   end;

   drop abb;
run;

This is not giving the expected results for "BEAUBRIDGE LLP" because "LLP" is not in the list at all. The sorting is required so that "V LP" is checked before "LP". I am not sure that this will work for abbreviations not on the current list.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1143 views
  • 0 likes
  • 4 in conversation