You have made some queries in your comment on using Hash objects. [1] Loading Both data sets int hash tables: Huge dataset with KEY and non-key variables is likely to run out of memory. This amounts to program aborting in the middle. The first part of the program, data want; if _n_ = 1 then do; if 0 then set SMALL; declare hash h(dataset:'K1', 'K2'); h.definekey('K1', 'K2'); h.definedone(); end; As SMALL is very small compared to HUGE, I loaded the KEY(K1 and K2) into Hash table. Internally, each hash entry in the hash table consumes the number bytes equivalent to twice the length of K1 and K2 as default when data part of the hash table is left out. We can save some memory by introducing the data part as: h.definedata('K1'); before the statement h.definedone(); Then the memory consumed will be length of K1, K2, K1 and we get a nominal reduction. [2] Sophisticated srategies to increase performance: It is a good question for using the code for Production Jobs. SAS has more than one way to solve any problem. I recommend to try out all possible SAS solutions noting both time and memory. By time I mean both REAL and CPU time. Then choose the best method. Use of HASEXP: With respect to solution using Hash Objects, we can use HASHEXP option. By default, it is 8. One can use from 0 to 20. Again the correct value for a given data set has to be tried out. This means that blindly using any value other tha 8 has to be empirically verified. Keep K1, K2 of HUGE: The I/O time to read the entire record of HUGE can be reduced but further processing is warranted as noted below. The statement set HUGE(keep = K1 K2); is expected to bring K1 and K2 only to the Program Data Vector(PDV). The Record ID(_N_) of HUGE can be saved for the matched record. In the next data step, use POINT= option to mark them as deleted. Alternatively using, set HUGE; flag = 0; if h.find() =0 then flag = 1; run; will identify records to delete(flag=1). [3] System does not have enough Ram: This takes to the question of what goes into the hash table. HUGE is not affecting the RAM as one record of it is brought to PDV at a time. Take the case of SMALL data set. If both K1 and K2 are numbers then each hash table entry will take for the first part of the code, 32 bytes plus SAS taking extra 16 bytes. We can reduce 8 bytes only by introducing the hash data part as observed above. If they are both character-types, then length of K1 rounded up to 8 bytes and similarly K2. Suppose K1 is 10 bytes and K2 is 3 bytes, the memory for K1 is 16 and K2 is 8 bytes. This amounts to 24 + 24 bytes and the extra 16 bytes SAS requires. When everything fails, one possibility is to split the SMALL data set into number of parts such that a part can be saved in the hash table. The HUGE data set has to be processed as many times as the number of parts. [4] Revised program that flags the records with 1 for deletion and 0 for inclusion. data want;
if _n_ = 1 then do;
declare hash h(HASHEXP:8);
h.definekey('K1', 'K2');
h.definedata('flag');
h.definedone();
do until(last);
set SMALL end = last;
h.add();
end;
end;
set HUGE;
flag = 0;
if h.find() = 0 then flag = 1;
run; The WANT data set is HUGE with additional variable FLAG. For further processing, WHERE = option of SET statement can be used in WANT as: data NEED; set WANT(where=(flag=0)); ... other processing ... run;
... View more