BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
cheema11
Calcite | Level 5

Hi All,

 

New to SAS, need some help in string matching. I have Table A which have millions of records. Table B has less than 5K records.

I want to match Address field of table A with Keywords field of table B. Currently doing Proc SQL with Like operation which is very slow and query takes ages. There is a cross join between both the tables. Both tables are in SAS.

Wanted to know the efficient way of doing this. Using SAS EG V 8.2.

 

It will be a great help. Thanks in advance.

 

 

SAS Data.PNG

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Thanks @PeterClemmensen for asking all the questions.

@cheema11 Please provide next time sample data via a working SAS data step so we don't have to spend time to create such data in order to provide tested code.

 

I believe below should do what you're asking for. It's still a resource intensive process but certainly better than a cartesian join.

data master;
  id=_n_;
  infile datalines truncover;
  input address $40.;
  datalines;
alias_address: abc@swy.com.au
alias_address: pa@baxa.com
alias_address: baxa@test.com
alias_address: 123456789@baxa.com
;

data lookup;
  infile datalines truncover;
  input keyword $ category $;
  datalines;
A swy
A baxa
B 12345678
C test
;

data lookup;
  set lookup;
  _str_len=length(category);
run;

proc sort data=lookup;
  by descending _str_len;
run;

data want(drop=_:);
  set master;
  _r_len=-1;
  do _i=1 to _nobs;
    set lookup point=_i nobs=_nobs;
    if _r_len>_str_len then leave;
    if find(address,category,'it') then
      do;
        output;
        _r_len=_str_len;
      end;
  end;
run;

proc print data=want;
run;

 

Patrick_0-1664184308051.png

 

View solution in original post

10 REPLIES 10
PeterClemmensen
Tourmaline | Level 20

Are these both SAS Data Sets?

cheema11
Calcite | Level 5
Yes both are sas data sets.
PeterClemmensen
Tourmaline | Level 20

Ok. What if a string in table a matches multiple substrings in table b?

cheema11
Calcite | Level 5
Return the one with longest string length. If string length is same, then return any one of them.
PeterClemmensen
Tourmaline | Level 20

The longest string from table b, right?

cheema11
Calcite | Level 5
Yes longest string from Table B
Patrick
Opal | Level 21

Thanks @PeterClemmensen for asking all the questions.

@cheema11 Please provide next time sample data via a working SAS data step so we don't have to spend time to create such data in order to provide tested code.

 

I believe below should do what you're asking for. It's still a resource intensive process but certainly better than a cartesian join.

data master;
  id=_n_;
  infile datalines truncover;
  input address $40.;
  datalines;
alias_address: abc@swy.com.au
alias_address: pa@baxa.com
alias_address: baxa@test.com
alias_address: 123456789@baxa.com
;

data lookup;
  infile datalines truncover;
  input keyword $ category $;
  datalines;
A swy
A baxa
B 12345678
C test
;

data lookup;
  set lookup;
  _str_len=length(category);
run;

proc sort data=lookup;
  by descending _str_len;
run;

data want(drop=_:);
  set master;
  _r_len=-1;
  do _i=1 to _nobs;
    set lookup point=_i nobs=_nobs;
    if _r_len>_str_len then leave;
    if find(address,category,'it') then
      do;
        output;
        _r_len=_str_len;
      end;
  end;
run;

proc print data=want;
run;

 

Patrick_0-1664184308051.png

 

cheema11
Calcite | Level 5
Thanks for the response on this. Really appreciate for the help. @PeterClemmensen @Patrick
cheema11
Calcite | Level 5
Hi @Patrick.
In response to your query...."What if a string in table A matches multiple substrings in table B?" My answer was "Return the one with longest string length in Table B. If string length is same, then return any one of them."

What do i need to change in the script if I want to return the string with shortest length in Table B?
Patrick
Opal | Level 21

@cheema11 wrote:
Hi @Patrick.
In response to your query...."What if a string in table A matches multiple substrings in table B?" My answer was "Return the one with longest string length in Table B. If string length is same, then return any one of them."

What do i need to change in the script if I want to return the string with shortest length in Table B?

@PeterClemmensen did all the work getting the full requirements from you.

 

Patrick_0-1664781929579.png

 

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2293 views
  • 0 likes
  • 3 in conversation