I have a requirement like this: There are some frequently quoted phrases. If Company_name contains one of these frequently quoted phrases in quotes, then we need to remove the quotes and the characters contained within them.
Can you please help to achieve this? I have 230 million company names and 25 frequently quoted phrases. I have to search 230mm company names for these 25 phrases.
Sample Data is like below.
Company Names:
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
Frequently quoted phrases:
De Peddelaars
De Surdelaars
Expected output:
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging
Hash Table :
data a; input name & $80.; cards; Awolsk Kanovereniging "De Baldelaars" Awolsk Kanovereniging "De Maldelaars" Awolsk Kanoverrriging "De Surdelaars" Awolsk Kanovereniging "De Udadelaars" Awolsk Kanovereniging "De Peddelaars" ; run; data b; input key & $40.; cards; De Peddelaars De Surdelaars ; run; data want; if _n_ eq 1 then do; if 0 then set a; declare hash h(dataset:'b'); h.definekey('key'); h.definedone(); end; set a; key=scan(trim(name),-1,'"'); if h.check()=0 then name=scan(name,1,'"'); drop key; run;
Xia Keshan
Hash Table :
data a; input name & $80.; cards; Awolsk Kanovereniging "De Baldelaars" Awolsk Kanovereniging "De Maldelaars" Awolsk Kanoverrriging "De Surdelaars" Awolsk Kanovereniging "De Udadelaars" Awolsk Kanovereniging "De Peddelaars" ; run; data b; input key & $40.; cards; De Peddelaars De Surdelaars ; run; data want; if _n_ eq 1 then do; if 0 then set a; declare hash h(dataset:'b'); h.definekey('key'); h.definedone(); end; set a; key=scan(trim(name),-1,'"'); if h.check()=0 then name=scan(name,1,'"'); drop key; run;
Xia Keshan
Thank you. We tried Hash instead of SQL. This worked as per the requirement.
I'm sure there's a way to do it with hash tables...but here's a relatively straight forward approach:
data companies;
format Company_Name $50.;
input Company_Name $50.;
datalines;
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
;
run;
data patterns;
format Phrase $50.;
input Phrase $50.;
datalines;
"De Peddelaars"
"De Surdelaars"
;
run;
proc sql;
create table want as
select
tranwrd(c.company_name,trim(p.phrase),'') as test
from
companies c
left outer join patterns p
on c.company_name like cats("%",p.phrase,"%");
quit;
produces
Awolsk Kanovereniging
Awolsk Kanoverrriging
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanovereniging "De Udadelaars"
Hi,
The only thing I would add is a couple of other options to do the code (as 230m records might cause some disk space issues with create table):
The first I thought of was update table, however you may also run into space issues.
The other is to generate a datastep from your source data (which may be more resource efficient):
data _null_;
set patterns end=last;
if _n_=1 then call execute('data companies; set companies;');
call execute(' company_name=tranwrd(company_name,'||"'"||strip(phrase)||"'"||',"");'); /* Note single and double quotes! */
if last then call execute(' run;');
run;
A final option is to move the quote to a new field using " as a delimiter, then its a simple where clause.
Would contains operator be better ?
c.company_name contains strip(p.phrase)
I'm not sure. Might have to test that.
As far as table size, it always depends on where the original data lives. If it is a SAS dataset, then using a hash function in conjunction with a data step modify might be fastest. If the source is some other data management system (i.e. Oracle), then you'll probably have to create a separate dataset anyway.
data want(drop=phrase);
set have;
length phrase $20;
do phrase = 'De Peddelaars', 'De Surdelaars';
company_name = tranwrd(strip(company_name),'"'||strip(phrase)||'"','');
end;
run;
Wouldn't that eliminate the quotes even when they aren't part of the phrase that's being suppressed?
I think it will not eliminate quotes that aren't part of phrase.
Assuming these 25 phrases are not too long using a regular expression could be worth trying. Below example uses the SAS function prxchange().
A lot of data bases have their own implementation of similar functions so if the data is stored in a data base you could implement using explicit pass-through using the similar db function.
data companies;
format Company_Name $50.;
input Company_Name $50.;
datalines;
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
;
run;
data patterns;
format Phrase $50.;
input Phrase $50.;
datalines;
De Peddelaars
De Surdelaars
;
run;
proc sql noprint;
select cats('"',Phrase,'"') into :pattern_list separated by '|'
from patterns
;
quit;
%put xxx: %bquote(&pattern_list);
%let reg_ex=%str(%')s/&pattern_list//oi%str(%');
%put yyy: ®_ex;
proc sql;
create table want as
select
Company_Name,
prxchange(%unquote(®_ex),1,company_name) as Company_Name_Cleansed format=$50.
from companies
;
quit;
Patrick,
As far as I know there is a limitation of the number of matched string for prxchange(.) . I don't know if prxchange() is good enough to hold all of those OP's key string .
Xia Keshan
There will certainly be a limit both for the length and number of elements of a RegEx string as well as some point where the number of OR conditions becomes inefficient. I don't know though where these limits are and could only find out with try and error.
It appears to work with 26 "phrases" though. If this is more or less efficient than the other approaches suggested would need testing with real data and volumes.
And here the code with 26 "phrases". I needed to use "options noquotelenmax;" in order to suppress the warning created by a macro variable string longer than 256 characters used as regular expression.
data companies;
format Company_Name $50.;
input Company_Name $50.;
datalines;
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
;
run;
data patterns(drop=_:);
format Phrase $50.;
input Phrase $50.;
_Phrase=Phrase;
do _i=1 to 12;
Phrase=cats(_Phrase,_i);
output;
end;
Phrase=_Phrase;
output;
datalines;
De Peddelaars
De Surdelaars
;
run;
proc sql noprint;
select cats('"',Phrase,'"') into :pattern_list separated by '|'
from patterns
;
quit;
%put xxx: %bquote(&pattern_list);
%let reg_ex=%str(%')s/&pattern_list//oi%str(%');
%put yyy: ®_ex;
options noquotelenmax;
proc sql;
create table want as
select
Company_Name,
prxchange(%unquote(®_ex),1,company_name) as Company_Name_Cleansed format=$50.
from companies
;
quit;
options quotelenmax;
It might be worth trying to update instead of creating a new copy of the table. Something like:
data a;
input name & $80.;
datalines;
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
;
filename t TEMP;
data _null_;
input key & $40.;
line = cats(
'update a set name = tranwrd(name, """',
key,
'""","") where name contains """',
key,
'""";'
);
file t;
put line;
datalines;
De Peddelaars
De Surdelaars
;
proc sql;
%include t / source2;
quit;
Ideally, SAS/SQL would get the updates done on the DBMS server, if that's where the data resides.
PG
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.