BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gandikk
Obsidian | Level 7

I have a requirement like this: There are some frequently quoted phrases. If Company_name contains one of these frequently quoted phrases in quotes, then we need to remove the quotes and the characters contained within them.

Can you please help to achieve this? I have 230 million company names and 25 frequently quoted phrases. I have to search 230mm company names for these 25 phrases.

Sample Data is like below.

Company Names:

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

Frequently quoted phrases:

De Peddelaars

De Surdelaars

Expected output:

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging

SAS@EMMAUS
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Hash Table :

data a;
input name & $80.;
cards;
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
;
run;
data b;
input key & $40.;
cards;
De Peddelaars
De Surdelaars
;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set a;
  declare hash h(dataset:'b');
  h.definekey('key');
  h.definedone();
 end;
set a;
 key=scan(trim(name),-1,'"');
 if h.check()=0 then name=scan(name,1,'"');
 drop key;
run;

Xia Keshan

View solution in original post

13 REPLIES 13
Ksharp
Super User

Hash Table :

data a;
input name & $80.;
cards;
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
;
run;
data b;
input key & $40.;
cards;
De Peddelaars
De Surdelaars
;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set a;
  declare hash h(dataset:'b');
  h.definekey('key');
  h.definedone();
 end;
set a;
 key=scan(trim(name),-1,'"');
 if h.check()=0 then name=scan(name,1,'"');
 drop key;
run;

Xia Keshan

gandikk
Obsidian | Level 7

Thank you. We tried Hash instead of SQL. This worked as per the requirement.

SAS@EMMAUS
DBailey
Lapis Lazuli | Level 10

I'm sure there's a way to do it with hash tables...but here's a relatively straight forward approach:

data companies;

format Company_Name $50.;

input Company_Name $50.;

datalines;

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

;

run;

data patterns;

format Phrase $50.;

input Phrase $50.;

datalines;

"De Peddelaars"

"De Surdelaars"

;

run;

proc sql;

create table want as

select

    tranwrd(c.company_name,trim(p.phrase),'') as test

from

    companies c

    left outer join  patterns p

        on c.company_name like cats("%",p.phrase,"%");

quit;

produces

Awolsk Kanovereniging

Awolsk Kanoverrriging

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanovereniging "De Udadelaars"

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

The only thing I would add is a couple of other options to do the code (as 230m records might cause some disk space issues with create table):

The first I thought of was update table, however you may also run into space issues.

The other is to generate a datastep from your source data (which may be more resource efficient):

data _null_;

     set patterns end=last;

     if _n_=1 then call execute('data companies; set companies;');

     call execute(' company_name=tranwrd(company_name,'||"'"||strip(phrase)||"'"||',"");');   /* Note single and double quotes! */

     if last then call execute(' run;');

run;

A final option is to move the quote to a new field using " as a delimiter, then its a simple where clause.

Ksharp
Super User

Would contains operator be better ?

c.company_name contains strip(p.phrase)

DBailey
Lapis Lazuli | Level 10

I'm not sure.  Might have to test that.

As far as table size, it always depends on where the original data lives.  If it is a SAS dataset, then using a hash function in conjunction with a data step modify  might be fastest.  If the source is some other data management system (i.e. Oracle), then you'll probably have to create a separate dataset anyway.

stat_sas
Ammonite | Level 13

data want(drop=phrase);

set have;

length phrase $20;

do phrase = 'De Peddelaars', 'De Surdelaars';

  company_name = tranwrd(strip(company_name),'"'||strip(phrase)||'"','');

end;

run;

DBailey
Lapis Lazuli | Level 10

Wouldn't that eliminate the quotes even when they aren't part of the phrase that's being suppressed?

stat_sas
Ammonite | Level 13

I think it will not eliminate quotes that aren't part of phrase.

Patrick
Opal | Level 21

Assuming these 25 phrases are not too long using a regular expression could be worth trying. Below example uses the SAS function prxchange().

A lot of data bases have their own implementation of similar functions so if the data is stored in a data base you could implement using explicit pass-through using the similar db function.

data companies;

  format Company_Name $50.;

  input Company_Name $50.;

  datalines;

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

;

run;

data patterns;

  format Phrase $50.;

  input Phrase $50.;

  datalines;

De Peddelaars

De Surdelaars

;

run;

proc sql noprint;

  select cats('"',Phrase,'"') into :pattern_list separated by '|'

  from patterns

  ;

quit;

%put xxx: %bquote(&pattern_list);

%let reg_ex=%str(%')s/&pattern_list//oi%str(%');

%put yyy: &reg_ex;

proc sql;

  create table want as

    select

      Company_Name,

      prxchange(%unquote(&reg_ex),1,company_name) as Company_Name_Cleansed format=$50.    

    from companies     

  ;

quit;

Ksharp
Super User

Patrick,

As far as I know there is a limitation of the number of matched string for prxchange(.) . I don't know if prxchange() is good enough to hold all of those OP's key string .

Xia Keshan

Patrick
Opal | Level 21

There will certainly be a limit both for the length and number of elements of a RegEx string as well as some point where the number of OR conditions becomes inefficient. I don't know though where these limits are and could only find out with try and error.

It appears to work with 26 "phrases" though. If this is more or less efficient than the other approaches suggested would need testing with real data and volumes.

And here the code with 26 "phrases". I needed to use "options noquotelenmax;" in order to suppress the warning created by a macro variable string longer than 256 characters used as regular expression.

data companies;

  format Company_Name $50.;

  input Company_Name $50.;

  datalines;

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

;

run;

data patterns(drop=_:);

  format Phrase $50.;

  input Phrase $50.;

  _Phrase=Phrase;

  do _i=1 to 12;

    Phrase=cats(_Phrase,_i);

    output;

  end;

  Phrase=_Phrase;

  output;

  datalines;

De Peddelaars

De Surdelaars

;

run;

proc sql noprint;

  select cats('"',Phrase,'"') into :pattern_list separated by '|'

  from patterns

  ;

quit;

%put xxx: %bquote(&pattern_list);

%let reg_ex=%str(%')s/&pattern_list//oi%str(%');

%put yyy: &reg_ex;

options noquotelenmax;

proc sql;

  create table want as

    select

      Company_Name,

      prxchange(%unquote(&reg_ex),1,company_name) as Company_Name_Cleansed format=$50.   

    from companies    

  ;

quit;

options quotelenmax;

PGStats
Opal | Level 21

It might be worth trying to update instead of creating a new copy of the table. Something like:

data a;

input name & $80.;

datalines;

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

;

filename t TEMP;

data _null_;

input key & $40.;

line = cats(

    'update a set name = tranwrd(name, """',

    key,

    '""","") where name contains """',

    key,

    '""";'

    );

file t;

put line;

datalines;

De Peddelaars

De Surdelaars

;

proc sql;

%include t / source2;

quit;

Ideally, SAS/SQL would get the updates done on the DBMS server, if that's where the data resides.

PG

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1517 views
  • 5 likes
  • 7 in conversation