Help using Base SAS procedures

Need to identify a group of strings in a field

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 19
Accepted Solution

Need to identify a group of strings in a field

I have a requirement like this: There are some frequently quoted phrases. If Company_name contains one of these frequently quoted phrases in quotes, then we need to remove the quotes and the characters contained within them.

Can you please help to achieve this? I have 230 million company names and 25 frequently quoted phrases. I have to search 230mm company names for these 25 phrases.

Sample Data is like below.

Company Names:

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

Frequently quoted phrases:

De Peddelaars

De Surdelaars

Expected output:

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging

SAS@EMMAUS

Accepted Solutions
Solution
‎01-05-2015 08:05 AM
Super User
Posts: 10,028

Re: Need to identify a group of strings in a field

Hash Table :

data a;
input name & $80.;
cards;
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
;
run;
data b;
input key & $40.;
cards;
De Peddelaars
De Surdelaars
;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set a;
  declare hash h(dataset:'b');
  h.definekey('key');
  h.definedone();
 end;
set a;
 key=scan(trim(name),-1,'"');
 if h.check()=0 then name=scan(name,1,'"');
 drop key;
run;

Xia Keshan

View solution in original post


All Replies
Solution
‎01-05-2015 08:05 AM
Super User
Posts: 10,028

Re: Need to identify a group of strings in a field

Hash Table :

data a;
input name & $80.;
cards;
Awolsk Kanovereniging "De Baldelaars"
Awolsk Kanovereniging "De Maldelaars"
Awolsk Kanoverrriging "De Surdelaars"
Awolsk Kanovereniging "De Udadelaars"
Awolsk Kanovereniging "De Peddelaars"
;
run;
data b;
input key & $40.;
cards;
De Peddelaars
De Surdelaars
;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set a;
  declare hash h(dataset:'b');
  h.definekey('key');
  h.definedone();
 end;
set a;
 key=scan(trim(name),-1,'"');
 if h.check()=0 then name=scan(name,1,'"');
 drop key;
run;

Xia Keshan

Occasional Contributor
Posts: 19

Re: Need to identify a group of strings in a field

Thank you. We tried Hash instead of SQL. This worked as per the requirement.

SAS@EMMAUS
Super Contributor
Posts: 578

Re: Need to identify a group of strings in a field

I'm sure there's a way to do it with hash tables...but here's a relatively straight forward approach:

data companies;

format Company_Name $50.;

input Company_Name $50.;

datalines;

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

;

run;

data patterns;

format Phrase $50.;

input Phrase $50.;

datalines;

"De Peddelaars"

"De Surdelaars"

;

run;

proc sql;

create table want as

select

    tranwrd(c.company_name,trim(p.phrase),'') as test

from

    companies c

    left outer join  patterns p

        on c.company_name like cats("%",p.phrase,"%");

quit;

produces

Awolsk Kanovereniging

Awolsk Kanoverrriging

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanovereniging "De Udadelaars"

Super User
Super User
Posts: 7,955

Re: Need to identify a group of strings in a field

Hi,

The only thing I would add is a couple of other options to do the code (as 230m records might cause some disk space issues with create table):

The first I thought of was update table, however you may also run into space issues.

The other is to generate a datastep from your source data (which may be more resource efficient):

data _null_;

     set patterns end=last;

     if _n_=1 then call execute('data companies; set companies;');

     call execute(' company_name=tranwrd(company_name,'||"'"||strip(phrase)||"'"||',"");');   /* Note single and double quotes! */

     if last then call execute(' run;');

run;

A final option is to move the quote to a new field using " as a delimiter, then its a simple where clause.

Super User
Posts: 10,028

Re: Need to identify a group of strings in a field

Would contains operator be better ?

c.company_name contains strip(p.phrase)

Super Contributor
Posts: 578

Re: Need to identify a group of strings in a field

I'm not sure.  Might have to test that.

As far as table size, it always depends on where the original data lives.  If it is a SAS dataset, then using a hash function in conjunction with a data step modify  might be fastest.  If the source is some other data management system (i.e. Oracle), then you'll probably have to create a separate dataset anyway.

Trusted Advisor
Posts: 1,228

Re: Need to identify a group of strings in a field

data want(drop=phrase);

set have;

length phrase $20;

do phrase = 'De Peddelaars', 'De Surdelaars';

  company_name = tranwrd(strip(company_name),'"'||strip(phrase)||'"','');

end;

run;

Super Contributor
Posts: 578

Re: Need to identify a group of strings in a field

Wouldn't that eliminate the quotes even when they aren't part of the phrase that's being suppressed?

Trusted Advisor
Posts: 1,228

Re: Need to identify a group of strings in a field

I think it will not eliminate quotes that aren't part of phrase.

Respected Advisor
Posts: 4,173

Re: Need to identify a group of strings in a field

Assuming these 25 phrases are not too long using a regular expression could be worth trying. Below example uses the SAS function prxchange().

A lot of data bases have their own implementation of similar functions so if the data is stored in a data base you could implement using explicit pass-through using the similar db function.

data companies;

  format Company_Name $50.;

  input Company_Name $50.;

  datalines;

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

;

run;

data patterns;

  format Phrase $50.;

  input Phrase $50.;

  datalines;

De Peddelaars

De Surdelaars

;

run;

proc sql noprint;

  select cats('"',Phrase,'"') into Smiley Tongueattern_list separated by '|'

  from patterns

  ;

quit;

%put xxx: %bquote(&pattern_list);

%let reg_ex=%str(%')s/&pattern_list//oi%str(%');

%put yyy: &reg_ex;

proc sql;

  create table want as

    select

      Company_Name,

      prxchange(%unquote(&reg_ex),1,company_name) as Company_Name_Cleansed format=$50.    

    from companies     

  ;

quit;

Super User
Posts: 10,028

Re: Need to identify a group of strings in a field

Patrick,

As far as I know there is a limitation of the number of matched string for prxchange(.) . I don't know if prxchange() is good enough to hold all of those OP's key string .

Xia Keshan

Respected Advisor
Posts: 4,173

Re: Need to identify a group of strings in a field

There will certainly be a limit both for the length and number of elements of a RegEx string as well as some point where the number of OR conditions becomes inefficient. I don't know though where these limits are and could only find out with try and error.

It appears to work with 26 "phrases" though. If this is more or less efficient than the other approaches suggested would need testing with real data and volumes.

And here the code with 26 "phrases". I needed to use "options noquotelenmax;" in order to suppress the warning created by a macro variable string longer than 256 characters used as regular expression.

data companies;

  format Company_Name $50.;

  input Company_Name $50.;

  datalines;

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

;

run;

data patterns(drop=_Smiley Happy;

  format Phrase $50.;

  input Phrase $50.;

  _Phrase=Phrase;

  do _i=1 to 12;

    Phrase=cats(_Phrase,_i);

    output;

  end;

  Phrase=_Phrase;

  output;

  datalines;

De Peddelaars

De Surdelaars

;

run;

proc sql noprint;

  select cats('"',Phrase,'"') into Smiley Tongueattern_list separated by '|'

  from patterns

  ;

quit;

%put xxx: %bquote(&pattern_list);

%let reg_ex=%str(%')s/&pattern_list//oi%str(%');

%put yyy: &reg_ex;

options noquotelenmax;

proc sql;

  create table want as

    select

      Company_Name,

      prxchange(%unquote(&reg_ex),1,company_name) as Company_Name_Cleansed format=$50.   

    from companies    

  ;

quit;

options quotelenmax;

Respected Advisor
Posts: 4,925

Re: Need to identify a group of strings in a field

It might be worth trying to update instead of creating a new copy of the table. Something like:

data a;

input name & $80.;

datalines;

Awolsk Kanovereniging "De Baldelaars"

Awolsk Kanovereniging "De Maldelaars"

Awolsk Kanoverrriging "De Surdelaars"

Awolsk Kanovereniging "De Udadelaars"

Awolsk Kanovereniging "De Peddelaars"

;

filename t TEMP;

data _null_;

input key & $40.;

line = cats(

    'update a set name = tranwrd(name, """',

    key,

    '""","") where name contains """',

    key,

    '""";'

    );

file t;

put line;

datalines;

De Peddelaars

De Surdelaars

;

proc sql;

%include t / source2;

quit;

Ideally, SAS/SQL would get the updates done on the DBMS server, if that's where the data resides.

PG

PG
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 511 views
  • 5 likes
  • 7 in conversation