Solved: Table Joining

triley · Posted 07-06-2017 03:37 PM

I was wondering if anyone had a better approach for a process I am currently running. What i'm trying to do is join 2 tables together based on Zip, and then do some matching techniques to find the "best" matches from the combined tables.

I have the process working OK now using sql joins, but the run time is very long. I was wondering if any of you had any ideas to optimize this process. I think its important to point out also that i don't want to remove the non-matches (match_ind = 0), I need to keep all records even if it doesnt result in a "match".

Please let me know if you have any questions or need further explanation of what i'm trying to do. I would be really interested to see if someone has a way to do this with either key indexing or hash tables as I haven't had much experiance with either.

Thanks in advance!

Tom

data have1;

input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;

datalines ;

1 aaa 12345

2 bbb 12345

3 ccc 55555 12345

4 ddd 99999

;

run;

data have2;

input key_b 2. name_b $ 4. zip_b $ 6. ;

datalines ;

5 aaa 12345

6 ggg 12345

7 ccc 12345

8 ddd 99999

9 hhh 99999

;

run;

proc sql;

create table want as

select a.*

, b.key_b

, b.name_b

, (a.name_a=b.name_b) as match_ind

from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b

order by key_a, key_b

;quit;

WANT:

key_a	name_a	zip_a	zip2_a	key_b	name_b	match_ind
1	aaa	12345		5	aaa	1
1	aaa	12345		6	ggg	0
1	aaa	12345		7	ccc	0
2	bbb	12345		5	aaa	0
2	bbb	12345		6	ggg	0
2	bbb	12345		7	ccc	0
3	ccc	55555	12345	5	aaa	0
3	ccc	55555	12345	6	ggg	0
3	ccc	55555	12345	7	ccc	1
4	ddd	99999		8	ddd	1
4	ddd	99999		9	hhh	0

ChrisNZ · Posted 07-07-2017 07:32 AM

A hash table is twice as fast on my machine.

data have1;
input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;
do i=1 to 1e3; output; end;
datalines ;
1 aaa 12345
2 bbb 12345
3 ccc 55555 12345
4 ddd 99999
run; 

data have2;
input key_b 2. name_b $ 4. zip_b $ 6. ;
do i=1 to 1e3; output; end;
datalines ;
5 aaa 12345
6 ggg 12345
7 ccc 12345
8 ddd 99999
9 hhh 99999
run;

data WANT2;
  SET HAVE1;
  if _n_=1 then do;
    dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
    HAVE2.definekey('ZIP_B');
    HAVE2.definedata('KEY_B','NAME_B');
    HAVE2.definedone();
    if 0 then set HAVE2;
  end;
  RC = HAVE2.find(key:ZIP_A);   
  do while (RC = 0) ;
    output;
    RC=HAVE2.find_next(key:ZIP_A) ;
  end ;
  RC = HAVE2.find(key:ZIP2_A) ;    
  do while (RC = 0) ;
    output;
    RC=HAVE2.find_next(key:ZIP2_A);
  end ;
  drop I RC;
run;

13250 proc sql;
13251 create table want as
13252 select a.*
13253 , b.key_b
13254 , b.name_b
13255 , (a.name_a=b.name_b) as match_ind
13256 from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b
13257 order by key_a, key_b
13258 ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that
      can not be optimized.
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT created, with 11000000 rows and 8 columns.

13258! quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.36 seconds
      user cpu time       9.15 seconds
      system cpu time     1.87 seconds
      memory              265375.66k
      OS Memory           338160.00k
      Timestamp            7/07/2017 11:24:42 PM

13270 data WANT2;
13271    SET HAVE1;
13272    if _n_=1 then do;
13273      dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
13274      HAVE2.definekey('ZIP_B');
13275      HAVE2.definedata('KEY_B','NAME_B');
13276      HAVE2.definedone();
13277      if 0 then set HAVE2;
13278    end;
13279    RC = HAVE2.find(key:ZIP_A) ;
13280    do while (RC = 0) ;
13281      output;
13282      RC=HAVE2.find_next(key:ZIP_A) ;
13283    end ;
13284    RC = HAVE2.find(key:ZIP2_A) ;
13285    do while (RC = 0) ;
13286      output;
13287      RC=HAVE2.find_next(key:ZIP2_A) ;
13288    end ;
13289    drop I RC;
13290 run;

NOTE: There were 5000 observations read from the data set WORK.HAVE2.
NOTE: There were 4000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.WANT2 has 11000000 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           3.27 seconds
      user cpu time       1.51 seconds
      system cpu time     1.17 seconds
      memory              697.37k
      OS Memory           74404.00k
      Timestamp            7/07/2017 11:24:51 PM

High-Performance SAS Coding - Third Edition

View solution in original post

Sven111 · Posted 07-06-2017 05:37 PM

I'm not sure if it'd make a huge difference but I've always had performance problems with OR's in JOIN conditions and try to avoid them, although I generally do all this in SQL Passthrough, I'm not sure if it makes a difference in straight PROC SQL. You could try something like this:

proc sql;
    create table want as
        select     a.*
                  ,b.key_b
                  ,b.name_b
                  ,(a.name_a=b.name_b) as match_ind
        from       have1 a 
        inner join have2 b 
                on a.zip_a = b.zip_b     
        UNION ALL
        select     a.*
                  ,b.key_b
                  ,b.name_b
                  ,(a.name_a=b.name_b) as match_ind
        from       have1 a 
        inner join have2 b 
                on a.zip2_a = b.zip_b
        order by    key_a, key_b 
;quit;

I'd also be interested to see alternate approaches to this though, I generally go for a SQL centric solution since that's what I'm most used to, but I love seeing other ways to do things.

LinusH · Posted 07-07-2017 04:10 AM

I don't have any code suggestions at this point, but some general guidelines.
- Allow as much memory as possible for the join to minimize swapping (MEMSIZE SORTSIZE)
- Use the PROC SQL option _method to anlyze how the query plan look like
- Moving the data to SPDE allows for better I/O and in some cases better WHERE clause evaluation. In this scenario you could try indexing - it <might> work...

Data never sleeps

ChrisNZ · Posted 07-07-2017 07:32 AM

A hash table is twice as fast on my machine.

data have1;
input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.;
do i=1 to 1e3; output; end;
datalines ;
1 aaa 12345
2 bbb 12345
3 ccc 55555 12345
4 ddd 99999
run; 

data have2;
input key_b 2. name_b $ 4. zip_b $ 6. ;
do i=1 to 1e3; output; end;
datalines ;
5 aaa 12345
6 ggg 12345
7 ccc 12345
8 ddd 99999
9 hhh 99999
run;

data WANT2;
  SET HAVE1;
  if _n_=1 then do;
    dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
    HAVE2.definekey('ZIP_B');
    HAVE2.definedata('KEY_B','NAME_B');
    HAVE2.definedone();
    if 0 then set HAVE2;
  end;
  RC = HAVE2.find(key:ZIP_A);   
  do while (RC = 0) ;
    output;
    RC=HAVE2.find_next(key:ZIP_A) ;
  end ;
  RC = HAVE2.find(key:ZIP2_A) ;    
  do while (RC = 0) ;
    output;
    RC=HAVE2.find_next(key:ZIP2_A);
  end ;
  drop I RC;
run;

13250 proc sql;
13251 create table want as
13252 select a.*
13253 , b.key_b
13254 , b.name_b
13255 , (a.name_a=b.name_b) as match_ind
13256 from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b
13257 order by key_a, key_b
13258 ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that
      can not be optimized.
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT created, with 11000000 rows and 8 columns.

13258! quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.36 seconds
      user cpu time       9.15 seconds
      system cpu time     1.87 seconds
      memory              265375.66k
      OS Memory           338160.00k
      Timestamp            7/07/2017 11:24:42 PM

13270 data WANT2;
13271    SET HAVE1;
13272    if _n_=1 then do;
13273      dcl hash HAVE2(dataset:'HAVE2', multidata: 'Y');
13274      HAVE2.definekey('ZIP_B');
13275      HAVE2.definedata('KEY_B','NAME_B');
13276      HAVE2.definedone();
13277      if 0 then set HAVE2;
13278    end;
13279    RC = HAVE2.find(key:ZIP_A) ;
13280    do while (RC = 0) ;
13281      output;
13282      RC=HAVE2.find_next(key:ZIP_A) ;
13283    end ;
13284    RC = HAVE2.find(key:ZIP2_A) ;
13285    do while (RC = 0) ;
13286      output;
13287      RC=HAVE2.find_next(key:ZIP2_A) ;
13288    end ;
13289    drop I RC;
13290 run;

NOTE: There were 5000 observations read from the data set WORK.HAVE2.
NOTE: There were 4000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.WANT2 has 11000000 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           3.27 seconds
      user cpu time       1.51 seconds
      system cpu time     1.17 seconds
      memory              697.37k
      OS Memory           74404.00k
      Timestamp            7/07/2017 11:24:51 PM

High-Performance SAS Coding - Third Edition

triley · Posted 07-07-2017 08:24 AM

The Hash table took my query from around 20 minutes to around 20 seconds! Looks like I need to start learning how to use Hash tables more now...Thank you very much for providing this info!

One question though with hash tables, can i still use this to do a comparison of a variable that isn't brought in? So lets say in my example there is also an address for each record, for the matching i need to do i want to compare if the address is the same, do i need to declare that in the "define data" part, or i guess how could i put something equivelant to the sql: a.address=b.address?

ChrisNZ · Posted 07-07-2017 06:48 PM

Just add this line at the bottom of the datastep:

MATCH__IND=(NAME_A=NAME_B);

Many more great performance tips in the book linked from my signature, but not on joins,

The chapter on table joins will be in the third edition, but it is such a vast topic, with so many variations, and join examples are so space-consuming, that I need a lot more time to organise this chapter into something useful.

High-Performance SAS Coding - Third Edition

triley · Posted 07-09-2017 12:52 PM

What if in the tables the variables are named the same?...my "real" data has both the "names" called name (so instead of name_a=name_b it is name=name)

ChrisNZ · Posted 07-09-2017 04:31 PM

Just add a rename dataset option wherever needed. Even in the hash table's data set.
HAVE2 (rename=(NAME=NAMEB))

High-Performance SAS Coding - Third Edition

triley · Posted 07-10-2017 11:45 AM

Chris I really appreciate all the help. I wasn't able to get the MATCH_IND to work though...I tried putting it everywhere in between data; and run; and it didn't get me values for every record. Could you maybe run it out and verify it is working on your end and then provide me with the code you got to work?

Thanks again...

ChrisNZ · Posted 07-10-2017 05:13 PM

This works.


data WANT2;
  SET HAVE1;
  if _n_=1 then do;
    dcl hash HAVE2(dataset:'HAVE2 (rename=(NAME=NAME_B))', multidata: 'Y');
    HAVE2.definekey('ZIP_B');
    HAVE2.definedata('KEY_B','NAME_B');
    HAVE2.definedone();
    if 0 then set HAVE2(rename=(NAME=NAME_B));
  end;
  RC = HAVE2.find(key:ZIP_A);   
  do while (RC = 0) ;
    MATCH_IND=(NAME_A=NAME_B);  
    output;
    RC=HAVE2.find_next(key:ZIP_A) ;
  end ;
  RC = HAVE2.find(key:ZIP2_A) ;    
  do while (RC = 0) ;
    MATCH_IND=(NAME_A=NAME_B);  
    output;
    RC=HAVE2.find_next(key:ZIP2_A);
  end ;
 drop I RC;
run;

Apologies for confusing you. Since there are output statements, the value should be derived before the data is output.

High-Performance SAS Coding - Third Edition

triley · Posted 07-10-2017 05:31 PM

Great, thank you...One more thing and I promise I won't ask anything additional of you...In my real data I have a "zip2_b" as well. I tried just adding it here (HAVE2.definekey('ZIP_B','ZIP2_B'); but it gave me an error. Essentially I want to do the same thing as a join like this:

where a.zip_a=b.zip_b

or a.zip_a=b.zip2_b

or a.zip2_a=b.zip_b

or a.zip2_a=b.zip2_b

ChrisNZ · Posted 07-10-2017 06:08 PM

(HAVE2.definekey('ZIP_B','ZIP2_B');

means you match key 1 AND key 2.

Since you want OR, you need to load 2 hash tables and scan each of them twice, like already do for the first one, so you essentially repeat the existing code twice in that data step.

High-Performance SAS Coding - Third Edition

triley · Posted 07-11-2017 08:31 AM

Makes sense...Thank you!

Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Registration is open

SAS Training: Just a Click Away