BookmarkSubscribeRSS Feed
ciro
Quartz | Level 8

Hi everybody,

I need to perform a join between a main table with about 200 millions records (call it table A) and a table with about 3 millions records (table B).

both have repeated observations (monthly)  per subject within a year.

the merging key is composed by two variables that identify the subject (X, Y) and the month (M).

The resulting table should be composed of all records of table A plus the records not found in table A but only in B. For these last type of records I would like to assign some of the variables from the closest month (if any) according to the following rule:

the closest previous month or, if there are no record from the past, the closest subsequent month.

both tables resides on a Oracle DBMS and are indexed by the variables (X,Y,M), although I can download them to use in sas

I would appreciate any suggestions. If hash objects are involved I would need some more explanations as I am quite new to this world. However I'm willing to learn about them.

Thank you very much in advance

 

10 REPLIES 10
PGStats
Opal | Level 21

Hi.

A small example (table A and B before and table A after) would greatly help understand what you want.

PG

PG
DanielSantos
Barite | Level 11

SAS is really needed? If both tables reside into ORACLE, have you considered running the query there?

If you do need the result in SAS I would br advisable to perform SQL pass-trough (inside ORALCE) and just retrieve the result.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

LinusH
Tourmaline | Level 20

From a performance view, writing a pass-thru SQL could be wise.

But by the description of the exception handling (only from table B, then...) could be quite intriguing to write in pure SQL. Language wise, I think the data step is preferable here. maybe hash, or last.-processing after a full join.

Is this an operations that should be done often, or is it an ad-hoc situation?

Data never sleeps
DanielSantos
Barite | Level 11

From a quick look to ORACLE's own documentation this should do the trick for exception handling:

select * from B where not exists (select * from A where A.X=B.X and A.Y=B.Y AND A.M=B.M)

This would return B except A by (X,Y,M).

Then it's just a matter of joining A back to the result, say with the UNION or if you need the whole thing in SAS bring back those two pieces of data and append them together.

With that volume of data and for most case, I guess it's better to let the DBMS handle it's own data and just transfer the results.

If you choose the "Hash/Merge way" be cautious with multiple keys on both sides (N to N relations).

Merge won't handle this on a traditional way, same for Hash bellow SAS v9.2.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

ciro
Quartz | Level 8

thank you to all for the answers.

following is an example of what i need.

as for the operations it is not to be performed often, so I can afford to bring the data soon in  sas and then perform the join.

another thing that i forgot and is not in the example is that I can have multiple records for the key (X,Y,M) in dataset A, though not frequently. In that case let's suppose I need variables from one record randomly for the moment

data A;

input X $ Y $ M Z $ var1 ;

cards;

A 0001 1 P 10

A 0001 2 P 11

A 0001 3 P 12

A 0001 4 P 13

A 0001 8 P 17

A 0001 9 P 18

A 0001 10 P 20

A 0002 4 P 21

A 0002 5 P 23

A 0002 6 P 25

A 0002 7 P 27

A 0002 8 P 27

A 0002 9 P 20

A 0002 10 P 15

A 0002 11 P 14

A 0002 12 P 12

B 0041 1 F 18

B 0041 2 F 18

B 0041 3 F 19

B 0041 4 F 20

B 0041 5 F 21

B 0041 6 F 17

B 0041 7 F 17

B 0041 8 F 17

B 0041 9 F 20

;

run;

data B;

input X $ Y $ M S $ var2 ;

cards;

A 0001 5 E 4

A 0001 6 E 4

A 0001 7 E 6

A 0001 10 O 2

A 0001 12 E 6

A 0002 3 E 4

B 0041 8 O 5

B 0041 9 O 6

B 0041 10 E 6

B 0041 11 E 6

;

run;

data A_after;

infile datalines delimiter=',';

input X $ Y $ M Z $ var1 S $ var2;

cards;

A,0001,1,P,10, ,

A,0001,2,P,11, ,

A,0001,3,P,12, ,

A,0001,4,P,13, ,

A,0001,5,P,13,E,4

A,0001,6,P,13,E,4

A,0001,7,P,13,E,6

A,0001,8,P,17, ,

A,0001,9,P,18, ,

A,0001,10,P,20,O,2

A,0001,12,P,20,E,6

A,0002,3,P,21,E,4

A,0002,4,P,21, ,

A,0002,5,P,23, ,

A,0002,6,P,25, ,

A,0002,7,P,27, ,

A,0002,8,P,27, ,

A,0002,9,P,20, ,

A,0002,10,P,15, ,

A,0002,11,P,14, ,

A,0002,12,P,12, ,

B,0041,1,F,18, ,

B,0041,2,F,18, ,

B,0041,3,F,19, ,

B,0041,4,F,20, ,

B,0041,5,F,21, ,

B,0041,6,F,17, ,

B,0041,7,F,17, ,

B,0041,8,F,17,O,5

B,0041,9,F,20,O,6

B,0041,10,F,20,E,6

B,0041,11,F,20,E,6

;

run;

PGStats
Opal | Level 21

Isn't very elegant, but does the job efficiently (efficiency claim untested Smiley Happy), assumes datasets are sorted by X Y M:

data A_temp(drop=r_:) A_fill(keep=X Y M Z var1) ;
merge a (in=inA) b;
by X Y M;
if inA then do;
     r_Z = Z;
     r_var1 = var1;
     end;
retain r_Z r_var1;
Z = coalescec(Z, r_Z);
var1 = coalesce(var1, r_var1);
output A_temp;
if missing(Z) or missing(lag(Z)) and not first.Y then output A_fill;
if last.Y then call missing(r_Z, r_var1);
run;

proc sort data=A_fill; by X Y descending M; run;

data A_upd(drop=r_:);
set A_fill;
by X Y;
if missing(Z) and not first.Y then do;
     Z = r_Z;
     var1 = r_var1;
     output;
     end;
else do;
     r_Z = Z;
     r_var1 = var1;
     end;
retain r_Z r_var1;
if last.Y then call missing(r_Z, r_var1);
run;

proc sort data=A_upd; by X Y M; run;

data A_after;
update A_temp A_upd;
by X Y M;
run;


proc print data=a_after; run;

PG

PG
Ksharp
Super User

It is not too fast.

Assuming there is unique combination of x y m for both datasets.

data A;
input X $ Y $ M Z $ var1 ;
cards;
A 0001 1 P 10
A 0001 2 P 11
A 0001 3 P 12
A 0001 4 P 13
A 0001 8 P 17
A 0001 9 P 18
A 0001 10 P 20
A 0002 4 P 21
A 0002 5 P 23
A 0002 6 P 25
A 0002 7 P 27
A 0002 8 P 27
A 0002 9 P 20
A 0002 10 P 15
A 0002 11 P 14
A 0002 12 P 12
B 0041 1 F 18
B 0041 2 F 18
B 0041 3 F 19
B 0041 4 F 20
B 0041 5 F 21
B 0041 6 F 17
B 0041 7 F 17
B 0041 8 F 17
B 0041 9 F 20
;
run;
data B;
input X $ Y $ M S $ var2 ;
cards;
A 0001 5 E 4
A 0001 6 E 4
A 0001 7 E 6
A 0001 10 O 2
A 0001 12 E 6
A 0002 3 E 4
B 0041 8 O 5
B 0041 9 O 6
B 0041 10 E 6
B 0041 11 E 6
;
run;
data temp(drop=_:);
 set a(in=ina) b ;
 by x y m;
 retain _z _var1;
 if ina then do;_z=z;_var1=var1; end;
   else  do;z=_z;var1=_var1; end;
run;
data want;
 set temp;
 by x y m ;
 if last.m;
run;



Ksharp

Message was edited by: xia keshan

mkeintz
PROC Star

f you know the largest gap in A, (i.e. the largest number of B records that separate 2 A records), then you can do this (which assumes the largest gap is 4 B records).  It provides  a single step solution, without reversion to hash or DOW loop:

data want;

  set a b;

  by x y m;

  z=coalescec(z,lag(z),lag2(z),lag3(z),lag4(z));

  var1=coalesce(var1,lag(var1),lag2(var1),lag3(var1),lag4(var1));

run;

This assumes (1) A and B are sorted by x y m, and (2) first record for a given X/Y is from A which always has a valid value for Z and VAR1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ciro
Quartz | Level 8

thank you everybody. I'll try some of the suggestions. no hashing though. isn't it useful in such a situation? As a further information I need the data in sas (not sure if temporary or permanent dataset) they are subject to further processing. 

Astounding
PROC Star

OK, one more country heard from:

data want;

   merge a (in=ina) b (in=inb);

   by x y m;

   if first.y then a_record_found=0;

   if ina then do;

      a_records_read + 1;

      a_record_found = 1;

      retain a_record_found;

      output;

   end;

   if inb and not ina;

   if a_record_found then do;

      pointer = a_records_read -1;

      set a point=pointer (keep=var1);

      output;

   end;

   else do;

      pointer = a_records_read + 1;

      set a pointer=pointer (keep=x y var1 rename=(x=proposed_match_x y=proposed_match_y var1=proposed_var1));

      if x=proposed_match_x and y=proposed_match_y then var1=proposed_var1;

      output;

   end;

   drop proposed_: a_record:;

run;

This might get a little unwieldy, depending on how many variables are really represented by "var1".  And it does account for a new subject (X Y) appearing in B that doesn't appear in A (as long as that new subject is NOT the last one in the combined file ... tweaks are possible if necessary).

Good luck.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1375 views
  • 0 likes
  • 7 in conversation