DATA Step, Macro, Functions and more

complex join with large tables

Reply
Contributor
Posts: 52

complex join with large tables

Hi everybody,

I need to perform a join between a main table with about 200 millions records (call it table A) and a table with about 3 millions records (table B).

both have repeated observations (monthly)  per subject within a year.

the merging key is composed by two variables that identify the subject (X, Y) and the month (M).

The resulting table should be composed of all records of table A plus the records not found in table A but only in B. For these last type of records I would like to assign some of the variables from the closest month (if any) according to the following rule:

the closest previous month or, if there are no record from the past, the closest subsequent month.

both tables resides on a Oracle DBMS and are indexed by the variables (X,Y,M), although I can download them to use in sas

I would appreciate any suggestions. If hash objects are involved I would need some more explanations as I am quite new to this world. However I'm willing to learn about them.

Thank you very much in advance

 

Respected Advisor
Posts: 4,935

Re: complex join with large tables

Hi.

A small example (table A and B before and table A after) would greatly help understand what you want.

PG

PG
Super Contributor
Posts: 474

Re: complex join with large tables

SAS is really needed? If both tables reside into ORACLE, have you considered running the query there?

If you do need the result in SAS I would br advisable to perform SQL pass-trough (inside ORALCE) and just retrieve the result.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Super User
Posts: 5,441

Re: complex join with large tables

Posted in reply to DanielSantos

From a performance view, writing a pass-thru SQL could be wise.

But by the description of the exception handling (only from table B, then...) could be quite intriguing to write in pure SQL. Language wise, I think the data step is preferable here. maybe hash, or last.-processing after a full join.

Is this an operations that should be done often, or is it an ad-hoc situation?

Data never sleeps
Super Contributor
Posts: 474

Re: complex join with large tables

From a quick look to ORACLE's own documentation this should do the trick for exception handling:

select * from B where not exists (select * from A where A.X=B.X and A.Y=B.Y AND A.M=B.M)

This would return B except A by (X,Y,M).

Then it's just a matter of joining A back to the result, say with the UNION or if you need the whole thing in SAS bring back those two pieces of data and append them together.

With that volume of data and for most case, I guess it's better to let the DBMS handle it's own data and just transfer the results.

If you choose the "Hash/Merge way" be cautious with multiple keys on both sides (N to N relations).

Merge won't handle this on a traditional way, same for Hash bellow SAS v9.2.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Contributor
Posts: 52

Re: complex join with large tables

Posted in reply to DanielSantos

thank you to all for the answers.

following is an example of what i need.

as for the operations it is not to be performed often, so I can afford to bring the data soon in  sas and then perform the join.

another thing that i forgot and is not in the example is that I can have multiple records for the key (X,Y,M) in dataset A, though not frequently. In that case let's suppose I need variables from one record randomly for the moment

data A;

input X $ Y $ M Z $ var1 ;

cards;

A 0001 1 P 10

A 0001 2 P 11

A 0001 3 P 12

A 0001 4 P 13

A 0001 8 P 17

A 0001 9 P 18

A 0001 10 P 20

A 0002 4 P 21

A 0002 5 P 23

A 0002 6 P 25

A 0002 7 P 27

A 0002 8 P 27

A 0002 9 P 20

A 0002 10 P 15

A 0002 11 P 14

A 0002 12 P 12

B 0041 1 F 18

B 0041 2 F 18

B 0041 3 F 19

B 0041 4 F 20

B 0041 5 F 21

B 0041 6 F 17

B 0041 7 F 17

B 0041 8 F 17

B 0041 9 F 20

;

run;

data B;

input X $ Y $ M S $ var2 ;

cards;

A 0001 5 E 4

A 0001 6 E 4

A 0001 7 E 6

A 0001 10 O 2

A 0001 12 E 6

A 0002 3 E 4

B 0041 8 O 5

B 0041 9 O 6

B 0041 10 E 6

B 0041 11 E 6

;

run;

data A_after;

infile datalines delimiter=',';

input X $ Y $ M Z $ var1 S $ var2;

cards;

A,0001,1,P,10, ,

A,0001,2,P,11, ,

A,0001,3,P,12, ,

A,0001,4,P,13, ,

A,0001,5,P,13,E,4

A,0001,6,P,13,E,4

A,0001,7,P,13,E,6

A,0001,8,P,17, ,

A,0001,9,P,18, ,

A,0001,10,P,20,O,2

A,0001,12,P,20,E,6

A,0002,3,P,21,E,4

A,0002,4,P,21, ,

A,0002,5,P,23, ,

A,0002,6,P,25, ,

A,0002,7,P,27, ,

A,0002,8,P,27, ,

A,0002,9,P,20, ,

A,0002,10,P,15, ,

A,0002,11,P,14, ,

A,0002,12,P,12, ,

B,0041,1,F,18, ,

B,0041,2,F,18, ,

B,0041,3,F,19, ,

B,0041,4,F,20, ,

B,0041,5,F,21, ,

B,0041,6,F,17, ,

B,0041,7,F,17, ,

B,0041,8,F,17,O,5

B,0041,9,F,20,O,6

B,0041,10,F,20,E,6

B,0041,11,F,20,E,6

;

run;

Respected Advisor
Posts: 4,935

Re: complex join with large tables

Isn't very elegant, but does the job efficiently (efficiency claim untested Smiley Happy), assumes datasets are sorted by X Y M:

data A_temp(drop=r_Smiley Happy A_fill(keep=X Y M Z var1) ;
merge a (in=inA) b;
by X Y M;
if inA then do;
     r_Z = Z;
     r_var1 = var1;
     end;
retain r_Z r_var1;
Z = coalescec(Z, r_Z);
var1 = coalesce(var1, r_var1);
output A_temp;
if missing(Z) or missing(lag(Z)) and not first.Y then output A_fill;
if last.Y then call missing(r_Z, r_var1);
run;

proc sort data=A_fill; by X Y descending M; run;

data A_upd(drop=r_Smiley Happy;
set A_fill;
by X Y;
if missing(Z) and not first.Y then do;
     Z = r_Z;
     var1 = r_var1;
     output;
     end;
else do;
     r_Z = Z;
     r_var1 = var1;
     end;
retain r_Z r_var1;
if last.Y then call missing(r_Z, r_var1);
run;

proc sort data=A_upd; by X Y M; run;

data A_after;
update A_temp A_upd;
by X Y M;
run;


proc print data=a_after; run;

PG

PG
Super User
Posts: 10,046

Re: complex join with large tables

It is not too fast.

Assuming there is unique combination of x y m for both datasets.

data A;
input X $ Y $ M Z $ var1 ;
cards;
A 0001 1 P 10
A 0001 2 P 11
A 0001 3 P 12
A 0001 4 P 13
A 0001 8 P 17
A 0001 9 P 18
A 0001 10 P 20
A 0002 4 P 21
A 0002 5 P 23
A 0002 6 P 25
A 0002 7 P 27
A 0002 8 P 27
A 0002 9 P 20
A 0002 10 P 15
A 0002 11 P 14
A 0002 12 P 12
B 0041 1 F 18
B 0041 2 F 18
B 0041 3 F 19
B 0041 4 F 20
B 0041 5 F 21
B 0041 6 F 17
B 0041 7 F 17
B 0041 8 F 17
B 0041 9 F 20
;
run;
data B;
input X $ Y $ M S $ var2 ;
cards;
A 0001 5 E 4
A 0001 6 E 4
A 0001 7 E 6
A 0001 10 O 2
A 0001 12 E 6
A 0002 3 E 4
B 0041 8 O 5
B 0041 9 O 6
B 0041 10 E 6
B 0041 11 E 6
;
run;
data temp(drop=_:);
 set a(in=ina) b ;
 by x y m;
 retain _z _var1;
 if ina then do;_z=z;_var1=var1; end;
   else  do;z=_z;var1=_var1; end;
run;
data want;
 set temp;
 by x y m ;
 if last.m;
run;



Ksharp

Message was edited by: xia keshan

Trusted Advisor
Posts: 1,022

Re: complex join with large tables

f you know the largest gap in A, (i.e. the largest number of B records that separate 2 A records), then you can do this (which assumes the largest gap is 4 B records).  It provides  a single step solution, without reversion to hash or DOW loop:

data want;

  set a b;

  by x y m;

  z=coalescec(z,lag(z),lag2(z),lag3(z),lag4(z));

  var1=coalesce(var1,lag(var1),lag2(var1),lag3(var1),lag4(var1));

run;

This assumes (1) A and B are sorted by x y m, and (2) first record for a given X/Y is from A which always has a valid value for Z and VAR1.

Contributor
Posts: 52

Re: complex join with large tables

thank you everybody. I'll try some of the suggestions. no hashing though. isn't it useful in such a situation? As a further information I need the data in sas (not sure if temporary or permanent dataset) they are subject to further processing. 

Super User
Posts: 5,518

Re: complex join with large tables

OK, one more country heard from:

data want;

   merge a (in=ina) b (in=inb);

   by x y m;

   if first.y then a_record_found=0;

   if ina then do;

      a_records_read + 1;

      a_record_found = 1;

      retain a_record_found;

      output;

   end;

   if inb and not ina;

   if a_record_found then do;

      pointer = a_records_read -1;

      set a point=pointer (keep=var1);

      output;

   end;

   else do;

      pointer = a_records_read + 1;

      set a pointer=pointer (keep=x y var1 rename=(x=proposed_match_x y=proposed_match_y var1=proposed_var1));

      if x=proposed_match_x and y=proposed_match_y then var1=proposed_var1;

      output;

   end;

   drop proposed_: a_record:;

run;

This might get a little unwieldy, depending on how many variables are really represented by "var1".  And it does account for a new subject (X Y) appearing in B that doesn't appear in A (as long as that new subject is NOT the last one in the combined file ... tweaks are possible if necessary).

Good luck.

Ask a Question
Discussion stats
  • 10 replies
  • 629 views
  • 0 likes
  • 7 in conversation