Hi All,
New to SAS, need some help in string matching. I have Table A which have millions of records. Table B has less than 5K records.
I want to match Address field of table A with Keywords field of table B. Currently doing Proc SQL with Like operation which is very slow and query takes ages. There is a cross join between both the tables. Both tables are in SAS.
Wanted to know the efficient way of doing this. Using SAS EG V 8.2.
It will be a great help. Thanks in advance.
Thanks @PeterClemmensen for asking all the questions.
@cheema11 Please provide next time sample data via a working SAS data step so we don't have to spend time to create such data in order to provide tested code.
I believe below should do what you're asking for. It's still a resource intensive process but certainly better than a cartesian join.
data master;
id=_n_;
infile datalines truncover;
input address $40.;
datalines;
alias_address: abc@swy.com.au
alias_address: pa@baxa.com
alias_address: baxa@test.com
alias_address: 123456789@baxa.com
;
data lookup;
infile datalines truncover;
input keyword $ category $;
datalines;
A swy
A baxa
B 12345678
C test
;
data lookup;
set lookup;
_str_len=length(category);
run;
proc sort data=lookup;
by descending _str_len;
run;
data want(drop=_:);
set master;
_r_len=-1;
do _i=1 to _nobs;
set lookup point=_i nobs=_nobs;
if _r_len>_str_len then leave;
if find(address,category,'it') then
do;
output;
_r_len=_str_len;
end;
end;
run;
proc print data=want;
run;
Are these both SAS Data Sets?
Ok. What if a string in table a matches multiple substrings in table b?
The longest string from table b, right?
Thanks @PeterClemmensen for asking all the questions.
@cheema11 Please provide next time sample data via a working SAS data step so we don't have to spend time to create such data in order to provide tested code.
I believe below should do what you're asking for. It's still a resource intensive process but certainly better than a cartesian join.
data master;
id=_n_;
infile datalines truncover;
input address $40.;
datalines;
alias_address: abc@swy.com.au
alias_address: pa@baxa.com
alias_address: baxa@test.com
alias_address: 123456789@baxa.com
;
data lookup;
infile datalines truncover;
input keyword $ category $;
datalines;
A swy
A baxa
B 12345678
C test
;
data lookup;
set lookup;
_str_len=length(category);
run;
proc sort data=lookup;
by descending _str_len;
run;
data want(drop=_:);
set master;
_r_len=-1;
do _i=1 to _nobs;
set lookup point=_i nobs=_nobs;
if _r_len>_str_len then leave;
if find(address,category,'it') then
do;
output;
_r_len=_str_len;
end;
end;
run;
proc print data=want;
run;
@cheema11 wrote:
Hi @Patrick.
In response to your query...."What if a string in table A matches multiple substrings in table B?" My answer was "Return the one with longest string length in Table B. If string length is same, then return any one of them."
What do i need to change in the script if I want to return the string with shortest length in Table B?
@PeterClemmensen did all the work getting the full requirements from you.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Check out this tutorial series to learn how to build your own steps in SAS Studio.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.