Hi All,
New to SAS, need some help in string matching. I have Table A which have millions of records. Table B has less than 5K records.
I want to match Address field of table A with Keywords field of table B. Currently doing Proc SQL with Like operation which is very slow and query takes ages. There is a cross join between both the tables. Both tables are in SAS.
Wanted to know the efficient way of doing this. Using SAS EG V 8.2.
It will be a great help. Thanks in advance.
Thanks @PeterClemmensen for asking all the questions.
@cheema11 Please provide next time sample data via a working SAS data step so we don't have to spend time to create such data in order to provide tested code.
I believe below should do what you're asking for. It's still a resource intensive process but certainly better than a cartesian join.
data master;
id=_n_;
infile datalines truncover;
input address $40.;
datalines;
alias_address: abc@swy.com.au
alias_address: pa@baxa.com
alias_address: baxa@test.com
alias_address: 123456789@baxa.com
;
data lookup;
infile datalines truncover;
input keyword $ category $;
datalines;
A swy
A baxa
B 12345678
C test
;
data lookup;
set lookup;
_str_len=length(category);
run;
proc sort data=lookup;
by descending _str_len;
run;
data want(drop=_:);
set master;
_r_len=-1;
do _i=1 to _nobs;
set lookup point=_i nobs=_nobs;
if _r_len>_str_len then leave;
if find(address,category,'it') then
do;
output;
_r_len=_str_len;
end;
end;
run;
proc print data=want;
run;
Are these both SAS Data Sets?
Ok. What if a string in table a matches multiple substrings in table b?
The longest string from table b, right?
Thanks @PeterClemmensen for asking all the questions.
@cheema11 Please provide next time sample data via a working SAS data step so we don't have to spend time to create such data in order to provide tested code.
I believe below should do what you're asking for. It's still a resource intensive process but certainly better than a cartesian join.
data master;
id=_n_;
infile datalines truncover;
input address $40.;
datalines;
alias_address: abc@swy.com.au
alias_address: pa@baxa.com
alias_address: baxa@test.com
alias_address: 123456789@baxa.com
;
data lookup;
infile datalines truncover;
input keyword $ category $;
datalines;
A swy
A baxa
B 12345678
C test
;
data lookup;
set lookup;
_str_len=length(category);
run;
proc sort data=lookup;
by descending _str_len;
run;
data want(drop=_:);
set master;
_r_len=-1;
do _i=1 to _nobs;
set lookup point=_i nobs=_nobs;
if _r_len>_str_len then leave;
if find(address,category,'it') then
do;
output;
_r_len=_str_len;
end;
end;
run;
proc print data=want;
run;
@cheema11 wrote:
Hi @Patrick.
In response to your query...."What if a string in table A matches multiple substrings in table B?" My answer was "Return the one with longest string length in Table B. If string length is same, then return any one of them."
What do i need to change in the script if I want to return the string with shortest length in Table B?
@PeterClemmensen did all the work getting the full requirements from you.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.