Hi all,
I have a list of abbreviations, as below,
data abbrev;
infile datalines truncover;
input abb $50.;
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;
and I also have list of firm names for example:
and the abbreviations have a blank (space) in front of them.
data firms;
infile datalines truncover;
input firmname $100.;
datalines;
AXXAM SPA
AXXESS COMPOUNDING LP
AXXIMA PHARMACEUTICALS AG
BEAR STEARNS HEALTH INNOVENTURES LP
BEAUBRIDGE LLP
CANNAVEST CORP
CARESTIA SA
COMIFAR SPA
DUTALYS GMBH
ESSEX CHEMIE AG
IMAGING3 INC
;
run;
desired output:
AXXAM
AXXESS COMPOUNDING
AXXIMA PHARMACEUTICALS
BEAR STEARNS HEALTH INNOVENTURES
BEAUBRIDGE
CANNAVEST
CARESTIA
COMIFAR
DUTALYS
ESSEX CHEMIE
IMAGING3
in my previous post andreas_lds recommended to use hash and regex,
thus I was writing the code as below but now i'm stuck in there 😞
data want;
if _n_=1 then do;
if 0 then set abbrev;
declare hash H (dataset:'abbrev',multidata:'y') ;
h.definekey ("abb") ;
h.definedata ("abb") ;
h.definedone () ;
end;
set firms;
firm_abb = scan(firmname, -1);
do rc=h.find(key:firm_abb) by 0 while(rc=0);
new_firm_name = prxchange(cats('s/(\w+)(word)/$1/'), 1, trim(firm_name));
end;
run;
this is first time writing perl regex thus i really don't know if it's the right expression.
length one 8;
This statement can be deleted. Originally I wanted to force the Data Part of Hash Item to use only 8 bytes to save memory. The default Data Part when the h.DefineData() is not used, will take as many bytes as the Key Part of hash item. Since Data set is loaded while hash object is defined, it is not possible to make the Data Part to take the variable 'one'. This can be achieved if the we load the hash table one item at a time. Each item (namely ABB) when loaded takes 96 bytes in the following program. In the previous program the item size took 144 bytes - a saving of 48 bytes for each item. However, our aim to Lookup the hash table is not affected.
data want; if _n_ = 1 then do; if 0 then set abbrev; length one 8; one = 10; declare hash h(); h.definekey('abb'); h.definedata('one'); h.definedone(); do until(eof); set abbrev end = eof; rc = h.add(); end; *mem1_used = h.item_size; *put mem1_used =; end; length word $50; set firms; nw = countw(firmname, ' '); word = trim(scan(firmname,nw)); rc = h.find(key:word); if rc = 0 then firmname = tranwrd(firmname,word, ' '); keep firmname; run;
Your second query to add an 'else part' is not needed. If the 'word' is found then only TRANWRD() function changes 'firmname', otherwise the 'firmname' remains the same. Note that firmname is not assigned to any other Name.
I hope you used my earlier program. You can use this program given above where 'else clause' is not used. I tested with:
data firms; infile datalines truncover; input firmname $100.; datalines; AXXAM SPA AXXESS COMPOUNDING LP AXXIMA PHARMACEUTICALS AG BEAR STEARNS HEALTH INNOVENTURES LP BEAUBRIDGE LLP CANNAVEST CORP CARESTIA SA COMIFAR SPA DUTALYS GMBH ESSEX CHEMIE AG IMAGING3 INC jimmychoi ; run;
Hope this helps your queries.
Best regards
DATAsp
The following steps be followed:
Save 'abb"s to a hash table.
Read firmname, take the last word.
Lookup the Hash Table. If found translate the word to blanks in firmname.
data abbrev; infile datalines truncover; input abb $50.; datalines; AG BV CORPORATION GMBH INC LIMITED LLC LP LTD PJSC PLC PTE PTY SA SA/NV SL SPA SRL COMPANY V LP CO NV HOLDINGS HOLDING ; run; data firms; infile datalines truncover; input firmname $100.; datalines; AXXAM SPA AXXESS COMPOUNDING LP AXXIMA PHARMACEUTICALS AG BEAR STEARNS HEALTH INNOVENTURES LP BEAUBRIDGE LLP CANNAVEST CORP CARESTIA SA COMIFAR SPA DUTALYS GMBH ESSEX CHEMIE AG IMAGING3 INC; run; data want; if _n_ = 1 then do; if 0 then set abbrev; length one 8; declare hash h(dataset:'abbrev'); h.definekey('abb'); h.definedone(); end; length word $50; set firms; nw = countw(firmname, ' '); word = trim(scan(firmname,nw)); rc = h.find(key:word); if rc = 0 then firmname = tranwrd(firmname,word, ' '); keep firmname; run; proc print data = want; run; Obs firmname 1 AXXESS COMPOUNDING 2 AXXIMA PHARMACEUTICALS 3 BEAR STEARNS HEALTH INNOVENTURES 4 BEAUBRIDGE LLP 5 CANNAVEST CORP 6 CARESTIA 7 COMIFAR 8 DUTALYS 9 ESSEX CHEMIE
DATAsp
Hi datasp, two quick questions:
1. what is this line for?
length one 8;
2. For firm names that do not contain abbreviations, I get blank in the firm names, which means their entire name is deleted. Shall I write something like,
if rc = 0 then firmname = tranwrd(firmname,word, ' ');
else if rc ne 0 then firmname;
to keep the original name?
length one 8;
This statement can be deleted. Originally I wanted to force the Data Part of Hash Item to use only 8 bytes to save memory. The default Data Part when the h.DefineData() is not used, will take as many bytes as the Key Part of hash item. Since Data set is loaded while hash object is defined, it is not possible to make the Data Part to take the variable 'one'. This can be achieved if the we load the hash table one item at a time. Each item (namely ABB) when loaded takes 96 bytes in the following program. In the previous program the item size took 144 bytes - a saving of 48 bytes for each item. However, our aim to Lookup the hash table is not affected.
data want; if _n_ = 1 then do; if 0 then set abbrev; length one 8; one = 10; declare hash h(); h.definekey('abb'); h.definedata('one'); h.definedone(); do until(eof); set abbrev end = eof; rc = h.add(); end; *mem1_used = h.item_size; *put mem1_used =; end; length word $50; set firms; nw = countw(firmname, ' '); word = trim(scan(firmname,nw)); rc = h.find(key:word); if rc = 0 then firmname = tranwrd(firmname,word, ' '); keep firmname; run;
Your second query to add an 'else part' is not needed. If the 'word' is found then only TRANWRD() function changes 'firmname', otherwise the 'firmname' remains the same. Note that firmname is not assigned to any other Name.
I hope you used my earlier program. You can use this program given above where 'else clause' is not used. I tested with:
data firms; infile datalines truncover; input firmname $100.; datalines; AXXAM SPA AXXESS COMPOUNDING LP AXXIMA PHARMACEUTICALS AG BEAR STEARNS HEALTH INNOVENTURES LP BEAUBRIDGE LLP CANNAVEST CORP CARESTIA SA COMIFAR SPA DUTALYS GMBH ESSEX CHEMIE AG IMAGING3 INC jimmychoi ; run;
Hope this helps your queries.
Best regards
DATAsp
i am just keeping the changes made. I have to do SA\/NV, as \ is escape character other wise prxchange will have issue as it things / as end of prxchange argument1.
cats("\b[a-zA-Z]*",abb,"[a-zA-Z]*$") into macro variable resolves (showing only partial value.
\b[a-zA-Z]*AG[a-zA-Z]*$|\b[a-zA-Z]*BV[a-zA-Z]*$ | is for or
\b[a-zA-Z]*AG[a-zA-Z]*$
\b=border can be start or end of word
[a-zA-Z]* alphabets can be there or not there
AG is your corporate word
[a-zA-Z]*$ alphabets can be there or not there as last word
in short we are catching last word and seeing whether any alphabets are there infront or back of it. and by using | we are telling to check for all the last words for your corporation(from your lookup table.)
data abbrev;
infile datalines truncover;
input abb $50.;
datalines;
SA\/NV
;
run;
proc sql;
select cats("\b[a-zA-Z]*",abb,"[a-zA-Z]*$") into :allabbs separated by "|"
from abbrev;
data final;
set firms;
new_firmname = compress(prxchange("s/&allabbs//", -1, trim(firmname)),"/");
run;
proc print;
In my previous answer, the "ABBREV" Data Set does not include "CORP" and "LLP". So firmnames with these two "abb"s are not removed.
"CORP" and "CORPORATION" is not same for matching. The OP can Add these two to get the correct answer. There is no change in the program.
DATAsp
Sorry, i am late 😉
You should add
GOOGLE V LP MICROSOFT SV/NV
to the datalines of firms, so that the difficult things are included.
With blanks in the abbreviations using a hash objects seems to be of no help. Because you can't get the complete abbreviation with
firm_abb = scan(firmname, -1);
you can't use firm_abb as key for the hash-object.
This works without having blanks in variable abb:
data want;
if 0 then set work.abbrev;
set work.firms;
if _n_ = 1 then do;
declare hash h(dataset: 'work.abbrev');
h.defineKey('abb');
h.defineDone();
end;
abb = scan(firmname, -1);
if h.check() = 0 then do;
firmname = prxchange(cats('s/(.*)\W', abb,'/$1/'), 1, firmname);
end;
run;
With the new setup i skip hash and use a loop reading work.abbrev with point-option:
data abbrev;
infile datalines truncover;
input abb $50.;
/* see comment blow code */
lenny = length(abb);
datalines;
AG
BV
CORPORATION
GMBH
INC
LIMITED
LLC
LP
LTD
PJSC
PLC
PTE
PTY
SA
SA/NV
SL
SPA
SRL
COMPANY
V LP
CO
NV
HOLDINGS
HOLDING
;
run;
proc sort data=work.abbrev;
by descending lenny;
run;
data want;
set work.firms;
i = 1;
do while (i <= nobs);
set work.abbrev point=i nobs=nobs;
if prxmatch(cat('/(.*) ', trim(tranwrd(abb, '/', '\/')), '/'), trim(firmname)) then do;
firmname = prxchange(cat('s/(.*) ', trim(tranwrd(abb, '/', '\/')), '/$1/'), 1, firmname);
i = nobs + 1;
end;
i = i + 1;
end;
drop abb;
run;
This is not giving the expected results for "BEAUBRIDGE LLP" because "LLP" is not in the list at all. The sorting is required so that "V LP" is checked before "LP". I am not sure that this will work for abbreviations not on the current list.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.