- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I was wondering if anyone had a better approach for a process I am currently running. What i'm trying to do is join 2 tables together based on Zip, and then do some matching techniques to find the "best" matches from the combined tables.
I have the process working OK now using sql joins, but the run time is very long. I was wondering if any of you had any ideas to optimize this process. I think its important to point out also that i don't want to remove the non-matches (match_ind = 0), I need to keep all records even if it doesnt result in a "match".
Please let me know if you have any questions or need further explanation of what i'm trying to do. I would be really interested to see if someone has a way to do this with either key indexing or hash tables as I haven't had much experiance with either.
Thanks in advance!
Tom
data have1;
input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;
datalines ;
1 aaa 12345
2 bbb 12345
3 ccc 55555 12345
4 ddd 99999
;
run;
data have2;
input key_b 2. name_b $ 4. zip_b $ 6. ;
datalines ;
5 aaa 12345
6 ggg 12345
7 ccc 12345
8 ddd 99999
9 hhh 99999
;
run;
proc sql;
create table want as
select a.*
, b.key_b
, b.name_b
, (a.name_a=b.name_b) as match_ind
from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b
order by key_a, key_b
;quit;
WANT:
key_a | name_a | zip_a | zip2_a | key_b | name_b | match_ind |
1 | aaa | 12345 | 5 | aaa | 1 | |
1 | aaa | 12345 | 6 | ggg | 0 | |
1 | aaa | 12345 | 7 | ccc | 0 | |
2 | bbb | 12345 | 5 | aaa | 0 | |
2 | bbb | 12345 | 6 | ggg | 0 | |
2 | bbb | 12345 | 7 | ccc | 0 | |
3 | ccc | 55555 | 12345 | 5 | aaa | 0 |
3 | ccc | 55555 | 12345 | 6 | ggg | 0 |
3 | ccc | 55555 | 12345 | 7 | ccc | 1 |
4 | ddd | 99999 | 8 | ddd | 1 | |
4 | ddd | 99999 | 9 | hhh | 0 |
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A hash table is twice as fast on my machine.
data have1;
input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;
do i=1 to 1e3; output; end;
datalines ;
1 aaa 12345
2 bbb 12345
3 ccc 55555 12345
4 ddd 99999
run;
data have2;
input key_b 2. name_b $ 4. zip_b $ 6. ;
do i=1 to 1e3; output; end;
datalines ;
5 aaa 12345
6 ggg 12345
7 ccc 12345
8 ddd 99999
9 hhh 99999
run;
data WANT2;
SET HAVE1;
if _n_=1 then do;
dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
HAVE2.definekey('ZIP_B');
HAVE2.definedata('KEY_B','NAME_B');
HAVE2.definedone();
if 0 then set HAVE2;
end;
RC = HAVE2.find(key:ZIP_A);
do while (RC = 0) ;
output;
RC=HAVE2.find_next(key:ZIP_A) ;
end ;
RC = HAVE2.find(key:ZIP2_A) ;
do while (RC = 0) ;
output;
RC=HAVE2.find_next(key:ZIP2_A);
end ;
drop I RC;
run;
13250 proc sql;
13251 create table want as
13252 select a.*
13253 , b.key_b
13254 , b.name_b
13255 , (a.name_a=b.name_b) as match_ind
13256 from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b
13257 order by key_a, key_b
13258 ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT created, with 11000000 rows and 8 columns.
13258! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 6.36 seconds
user cpu time 9.15 seconds
system cpu time 1.87 seconds
memory 265375.66k
OS Memory 338160.00k
Timestamp 7/07/2017 11:24:42 PM
13270 data WANT2;
13271 SET HAVE1;
13272 if _n_=1 then do;
13273 dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
13274 HAVE2.definekey('ZIP_B');
13275 HAVE2.definedata('KEY_B','NAME_B');
13276 HAVE2.definedone();
13277 if 0 then set HAVE2;
13278 end;
13279 RC = HAVE2.find(key:ZIP_A) ;
13280 do while (RC = 0) ;
13281 output;
13282 RC=HAVE2.find_next(key:ZIP_A) ;
13283 end ;
13284 RC = HAVE2.find(key:ZIP2_A) ;
13285 do while (RC = 0) ;
13286 output;
13287 RC=HAVE2.find_next(key:ZIP2_A) ;
13288 end ;
13289 drop I RC;
13290 run;
NOTE: There were 5000 observations read from the data set WORK.HAVE2.
NOTE: There were 4000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.WANT2 has 11000000 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 3.27 seconds
user cpu time 1.51 seconds
system cpu time 1.17 seconds
memory 697.37k
OS Memory 74404.00k
Timestamp 7/07/2017 11:24:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure if it'd make a huge difference but I've always had performance problems with OR's in JOIN conditions and try to avoid them, although I generally do all this in SQL Passthrough, I'm not sure if it makes a difference in straight PROC SQL. You could try something like this:
proc sql;
create table want as
select a.*
,b.key_b
,b.name_b
,(a.name_a=b.name_b) as match_ind
from have1 a
inner join have2 b
on a.zip_a = b.zip_b
UNION ALL
select a.*
,b.key_b
,b.name_b
,(a.name_a=b.name_b) as match_ind
from have1 a
inner join have2 b
on a.zip2_a = b.zip_b
order by key_a, key_b
;quit;
I'd also be interested to see alternate approaches to this though, I generally go for a SQL centric solution since that's what I'm most used to, but I love seeing other ways to do things.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Allow as much memory as possible for the join to minimize swapping (MEMSIZE SORTSIZE)
- Use the PROC SQL option _method to anlyze how the query plan look like
- Moving the data to SPDE allows for better I/O and in some cases better WHERE clause evaluation. In this scenario you could try indexing - it <might> work...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A hash table is twice as fast on my machine.
data have1;
input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;
do i=1 to 1e3; output; end;
datalines ;
1 aaa 12345
2 bbb 12345
3 ccc 55555 12345
4 ddd 99999
run;
data have2;
input key_b 2. name_b $ 4. zip_b $ 6. ;
do i=1 to 1e3; output; end;
datalines ;
5 aaa 12345
6 ggg 12345
7 ccc 12345
8 ddd 99999
9 hhh 99999
run;
data WANT2;
SET HAVE1;
if _n_=1 then do;
dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
HAVE2.definekey('ZIP_B');
HAVE2.definedata('KEY_B','NAME_B');
HAVE2.definedone();
if 0 then set HAVE2;
end;
RC = HAVE2.find(key:ZIP_A);
do while (RC = 0) ;
output;
RC=HAVE2.find_next(key:ZIP_A) ;
end ;
RC = HAVE2.find(key:ZIP2_A) ;
do while (RC = 0) ;
output;
RC=HAVE2.find_next(key:ZIP2_A);
end ;
drop I RC;
run;
13250 proc sql;
13251 create table want as
13252 select a.*
13253 , b.key_b
13254 , b.name_b
13255 , (a.name_a=b.name_b) as match_ind
13256 from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b
13257 order by key_a, key_b
13258 ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT created, with 11000000 rows and 8 columns.
13258! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 6.36 seconds
user cpu time 9.15 seconds
system cpu time 1.87 seconds
memory 265375.66k
OS Memory 338160.00k
Timestamp 7/07/2017 11:24:42 PM
13270 data WANT2;
13271 SET HAVE1;
13272 if _n_=1 then do;
13273 dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
13274 HAVE2.definekey('ZIP_B');
13275 HAVE2.definedata('KEY_B','NAME_B');
13276 HAVE2.definedone();
13277 if 0 then set HAVE2;
13278 end;
13279 RC = HAVE2.find(key:ZIP_A) ;
13280 do while (RC = 0) ;
13281 output;
13282 RC=HAVE2.find_next(key:ZIP_A) ;
13283 end ;
13284 RC = HAVE2.find(key:ZIP2_A) ;
13285 do while (RC = 0) ;
13286 output;
13287 RC=HAVE2.find_next(key:ZIP2_A) ;
13288 end ;
13289 drop I RC;
13290 run;
NOTE: There were 5000 observations read from the data set WORK.HAVE2.
NOTE: There were 4000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.WANT2 has 11000000 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 3.27 seconds
user cpu time 1.51 seconds
system cpu time 1.17 seconds
memory 697.37k
OS Memory 74404.00k
Timestamp 7/07/2017 11:24:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The Hash table took my query from around 20 minutes to around 20 seconds! Looks like I need to start learning how to use Hash tables more now...Thank you very much for providing this info!
One question though with hash tables, can i still use this to do a comparison of a variable that isn't brought in? So lets say in my example there is also an address for each record, for the matching i need to do i want to compare if the address is the same, do i need to declare that in the "define data" part, or i guess how could i put something equivelant to the sql: a.address=b.address?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just add this line at the bottom of the datastep:
MATCH__IND=(NAME_A=NAME_B);
Many more great performance tips in the book linked from my signature, but not on joins,
The chapter on table joins will be in the third edition, but it is such a vast topic, with so many variations, and join examples are so space-consuming, that I need a lot more time to organise this chapter into something useful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
HAVE2 (rename=(NAME=NAMEB))
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks again...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This works.
data WANT2;
SET HAVE1;
if _n_=1 then do;
dcl hash HAVE2(dataset:'HAVE2 (rename=(NAME=NAME_B))', multidata: 'Y');
HAVE2.definekey('ZIP_B');
HAVE2.definedata('KEY_B','NAME_B');
HAVE2.definedone();
if 0 then set HAVE2(rename=(NAME=NAME_B));
end;
RC = HAVE2.find(key:ZIP_A);
do while (RC = 0) ;
MATCH_IND=(NAME_A=NAME_B);
output;
RC=HAVE2.find_next(key:ZIP_A) ;
end ;
RC = HAVE2.find(key:ZIP2_A) ;
do while (RC = 0) ;
MATCH_IND=(NAME_A=NAME_B);
output;
RC=HAVE2.find_next(key:ZIP2_A);
end ;
drop I RC;
run;
Apologies for confusing you. Since there are output statements, the value should be derived before the data is output.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
where a.zip_a=b.zip_b
or a.zip_a=b.zip2_b
or a.zip2_a=b.zip_b
or a.zip2_a=b.zip2_b
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
(HAVE2.definekey('ZIP_B','ZIP2_B');
means you match key 1 AND key 2.
Since you want OR, you need to load 2 hash tables and scan each of them twice, like already do for the first one, so you essentially repeat the existing code twice in that data step.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content