BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
triley
Obsidian | Level 7

I was wondering if anyone had a better approach for a process I am currently running. What i'm trying to do is join 2 tables together based on Zip, and then do some matching techniques to find the "best" matches from the combined tables.

 

I have the process working OK now using sql joins, but the run time is very long. I was wondering if any of you had any ideas to optimize this process. I think its important to point out also that i don't want to remove the non-matches (match_ind = 0), I need to keep all records even if it doesnt result in a "match".

 

Please let me know if you have any questions or need further explanation of what i'm trying to do. I would be really interested to see if someone has a way to do this with either key indexing or hash tables as I haven't had much experiance with either.

 

Thanks in advance!

Tom

 

data have1;

input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;

datalines ;

1 aaa 12345

2 bbb 12345

3 ccc 55555 12345

4 ddd 99999

;

run;

 

 

data have2;

input key_b 2. name_b $ 4. zip_b $ 6. ;

datalines ;

5 aaa 12345

6 ggg 12345

7 ccc 12345

8 ddd 99999

9 hhh 99999

;

run;

 

proc sql;

create table want as

select a.*

, b.key_b

, b.name_b

, (a.name_a=b.name_b) as match_ind

from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b

order by key_a, key_b

;quit;

 

 

WANT:

key_a name_a zip_a zip2_a key_b name_b match_ind
1 aaa 12345   5 aaa 1
1 aaa 12345   6 ggg 0
1 aaa 12345   7 ccc 0
2 bbb 12345   5 aaa 0
2 bbb 12345   6 ggg 0
2 bbb 12345   7 ccc 0
3 ccc 55555 12345 5 aaa 0
3 ccc 55555 12345 6 ggg 0
3 ccc 55555 12345 7 ccc 1
4 ddd 99999   8 ddd 1
4 ddd 99999   9 hhh 0
1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

A hash table is twice as fast on my machine.

 

data have1;
input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;
do i=1 to 1e3; output; end;
datalines ;
1 aaa 12345
2 bbb 12345
3 ccc 55555 12345
4 ddd 99999
run; 

data have2;
input key_b 2. name_b $ 4. zip_b $ 6. ;
do i=1 to 1e3; output; end;
datalines ;
5 aaa 12345
6 ggg 12345
7 ccc 12345
8 ddd 99999
9 hhh 99999
run;

data WANT2;
  SET HAVE1;
  if _n_=1 then do;
    dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
    HAVE2.definekey('ZIP_B');
    HAVE2.definedata('KEY_B','NAME_B');
    HAVE2.definedone();
    if 0 then set HAVE2;
  end;
  RC = HAVE2.find(key:ZIP_A);   
  do while (RC = 0) ;
    output;
    RC=HAVE2.find_next(key:ZIP_A) ;
  end ;
  RC = HAVE2.find(key:ZIP2_A) ;    
  do while (RC = 0) ;
    output;
    RC=HAVE2.find_next(key:ZIP2_A);
  end ;
  drop I RC;
run;

13250  proc sql;
13251  create table want as
13252  select a.*
13253  , b.key_b
13254  , b.name_b
13255  , (a.name_a=b.name_b) as match_ind
13256  from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b
13257  order by key_a, key_b
13258  ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that
      can not be optimized.
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT created, with 11000000 rows and 8 columns.

13258!  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.36 seconds
      user cpu time       9.15 seconds
      system cpu time     1.87 seconds
      memory              265375.66k
      OS Memory           338160.00k
      Timestamp            7/07/2017 11:24:42 PM

13270  data WANT2;
13271    SET HAVE1;
13272    if _n_=1 then do;
13273      dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
13274      HAVE2.definekey('ZIP_B');
13275      HAVE2.definedata('KEY_B','NAME_B');
13276      HAVE2.definedone();
13277      if 0 then set HAVE2;
13278    end;
13279    RC = HAVE2.find(key:ZIP_A) ;
13280    do while (RC = 0) ;
13281      output;
13282      RC=HAVE2.find_next(key:ZIP_A) ;
13283    end ;
13284    RC = HAVE2.find(key:ZIP2_A) ;
13285    do while (RC = 0) ;
13286      output;
13287      RC=HAVE2.find_next(key:ZIP2_A) ;
13288    end ;
13289    drop I RC;
13290  run;

NOTE: There were 5000 observations read from the data set WORK.HAVE2.
NOTE: There were 4000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.WANT2 has 11000000 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           3.27 seconds
      user cpu time       1.51 seconds
      system cpu time     1.17 seconds
      memory              697.37k
      OS Memory           74404.00k
      Timestamp            7/07/2017 11:24:51 PM

View solution in original post

12 REPLIES 12
Sven111
Pyrite | Level 9

I'm not sure if it'd make a huge difference but I've always had performance problems with OR's in JOIN conditions and try to avoid them, although I generally do all this in SQL Passthrough, I'm not sure if it makes a difference in straight PROC SQL.  You could try something like this:

 

proc sql;
    create table want as
        select     a.*
                  ,b.key_b
                  ,b.name_b
                  ,(a.name_a=b.name_b) as match_ind
        from       have1 a 
        inner join have2 b 
                on a.zip_a = b.zip_b     
        UNION ALL
        select     a.*
                  ,b.key_b
                  ,b.name_b
                  ,(a.name_a=b.name_b) as match_ind
        from       have1 a 
        inner join have2 b 
                on a.zip2_a = b.zip_b
        order by    key_a, key_b 
;quit;

I'd also be interested to see alternate approaches to this though, I generally go for a SQL centric solution since that's what I'm most used to, but I love seeing other ways to do things.

LinusH
Tourmaline | Level 20
I don't have any code suggestions at this point, but some general guidelines.
- Allow as much memory as possible for the join to minimize swapping (MEMSIZE SORTSIZE)
- Use the PROC SQL option _method to anlyze how the query plan look like
- Moving the data to SPDE allows for better I/O and in some cases better WHERE clause evaluation. In this scenario you could try indexing - it <might> work...
Data never sleeps
ChrisNZ
Tourmaline | Level 20

A hash table is twice as fast on my machine.

 

data have1;
input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;
do i=1 to 1e3; output; end;
datalines ;
1 aaa 12345
2 bbb 12345
3 ccc 55555 12345
4 ddd 99999
run; 

data have2;
input key_b 2. name_b $ 4. zip_b $ 6. ;
do i=1 to 1e3; output; end;
datalines ;
5 aaa 12345
6 ggg 12345
7 ccc 12345
8 ddd 99999
9 hhh 99999
run;

data WANT2;
  SET HAVE1;
  if _n_=1 then do;
    dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
    HAVE2.definekey('ZIP_B');
    HAVE2.definedata('KEY_B','NAME_B');
    HAVE2.definedone();
    if 0 then set HAVE2;
  end;
  RC = HAVE2.find(key:ZIP_A);   
  do while (RC = 0) ;
    output;
    RC=HAVE2.find_next(key:ZIP_A) ;
  end ;
  RC = HAVE2.find(key:ZIP2_A) ;    
  do while (RC = 0) ;
    output;
    RC=HAVE2.find_next(key:ZIP2_A);
  end ;
  drop I RC;
run;

13250  proc sql;
13251  create table want as
13252  select a.*
13253  , b.key_b
13254  , b.name_b
13255  , (a.name_a=b.name_b) as match_ind
13256  from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b
13257  order by key_a, key_b
13258  ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that
      can not be optimized.
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT created, with 11000000 rows and 8 columns.

13258!  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.36 seconds
      user cpu time       9.15 seconds
      system cpu time     1.87 seconds
      memory              265375.66k
      OS Memory           338160.00k
      Timestamp            7/07/2017 11:24:42 PM

13270  data WANT2;
13271    SET HAVE1;
13272    if _n_=1 then do;
13273      dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
13274      HAVE2.definekey('ZIP_B');
13275      HAVE2.definedata('KEY_B','NAME_B');
13276      HAVE2.definedone();
13277      if 0 then set HAVE2;
13278    end;
13279    RC = HAVE2.find(key:ZIP_A) ;
13280    do while (RC = 0) ;
13281      output;
13282      RC=HAVE2.find_next(key:ZIP_A) ;
13283    end ;
13284    RC = HAVE2.find(key:ZIP2_A) ;
13285    do while (RC = 0) ;
13286      output;
13287      RC=HAVE2.find_next(key:ZIP2_A) ;
13288    end ;
13289    drop I RC;
13290  run;

NOTE: There were 5000 observations read from the data set WORK.HAVE2.
NOTE: There were 4000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.WANT2 has 11000000 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           3.27 seconds
      user cpu time       1.51 seconds
      system cpu time     1.17 seconds
      memory              697.37k
      OS Memory           74404.00k
      Timestamp            7/07/2017 11:24:51 PM

triley
Obsidian | Level 7

The Hash table took my query from around 20 minutes to around 20 seconds! Looks like I need to start learning how to use Hash tables more now...Thank you very much for providing this info!

 

One question though with hash tables, can i still use this to do a comparison of a variable that isn't brought in? So lets say in my example there is also an address for each record, for the matching i need to do i want to compare if the address is the same, do i need to declare that in the "define data" part, or i guess how could i put something equivelant to the sql:  a.address=b.address?

ChrisNZ
Tourmaline | Level 20

Just add this line at the bottom of the datastep:

MATCH__IND=(NAME_A=NAME_B);

Many more great performance tips in the book linked from my signature, but not on joins,

The chapter on table joins will be in the third edition, but it is such a vast topic, with so many variations, and join examples are so space-consuming, that I need a lot more time to organise this chapter into something useful.

triley
Obsidian | Level 7
What if in the tables the variables are named the same?...my "real" data has both the "names" called name (so instead of name_a=name_b it is name=name)


ChrisNZ
Tourmaline | Level 20
Just add a rename dataset option wherever needed. Even in the hash table's data set.
HAVE2 (rename=(NAME=NAMEB))
triley
Obsidian | Level 7
Chris I really appreciate all the help. I wasn't able to get the MATCH_IND to work though...I tried putting it everywhere in between data; and run; and it didn't get me values for every record. Could you maybe run it out and verify it is working on your end and then provide me with the code you got to work?



Thanks again...


ChrisNZ
Tourmaline | Level 20

This works.


data WANT2;
  SET HAVE1;
  if _n_=1 then do;
    dcl hash HAVE2(dataset:'HAVE2 (rename=(NAME=NAME_B))', multidata: 'Y');
    HAVE2.definekey('ZIP_B');
    HAVE2.definedata('KEY_B','NAME_B');
    HAVE2.definedone();
    if 0 then set HAVE2(rename=(NAME=NAME_B));
  end;
  RC = HAVE2.find(key:ZIP_A);   
  do while (RC = 0) ;
    MATCH_IND=(NAME_A=NAME_B);  
    output;
    RC=HAVE2.find_next(key:ZIP_A) ;
  end ;
  RC = HAVE2.find(key:ZIP2_A) ;    
  do while (RC = 0) ;
    MATCH_IND=(NAME_A=NAME_B);  
    output;
    RC=HAVE2.find_next(key:ZIP2_A);
  end ;
 drop I RC;
run;

Apologies for confusing you. Since there are output statements, the value should be derived before the data is output.

 

triley
Obsidian | Level 7
Great, thank you...One more thing and I promise I won't ask anything additional of you...In my real data I have a "zip2_b" as well. I tried just adding it here (HAVE2.definekey('ZIP_B','ZIP2_B'); but it gave me an error. Essentially I want to do the same thing as a join like this:



where a.zip_a=b.zip_b

or a.zip_a=b.zip2_b

or a.zip2_a=b.zip_b

or a.zip2_a=b.zip2_b






ChrisNZ
Tourmaline | Level 20

(HAVE2.definekey('ZIP_B','ZIP2_B');

means you match key 1 AND key 2.

 

Since you want OR, you need to load 2 hash tables and scan each of them twice, like already do for the first one, so you essentially repeat the existing code twice in that data step.

 

 

 

triley
Obsidian | Level 7
Makes sense...Thank you!


sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1425 views
  • 3 likes
  • 4 in conversation