Hi guys,
I tried all your suggestions, here are the results. (HUGE is 70GB, SMALL is 10MB)
HASH solution, by @KachiM: amazing, it does all the job in 7 minutes, including CPU time and I/O time (writing 70GB on disk takes time, just doing "cp HUGE HUGE2" takes 3 minutes) Pro: fastest solution Cons: since it creates a new table, it wastes I/O time, needs double disk space, and needs to recreate the index after the operation Comments: I didn't know hashes, so I read about them on documentation; in your opinion, do you think that loading both tables in memory would speed up more? Are there more sophisticated strategies to increase performance with hashes? Moreover, what happens if the system does not have enough ram to hold the tables, it just slows down or sas crashes?
DELETE solution, by @PGStats: thank you for the correct syntax, it will be surely useful! However after 1 hour I aborted it. Pro: Rollback available, no double space, no need to recreate the index Cons: too much time Comments: too much time for production environment
MODIFY solution, by @ChrisNZ: quite good solution, it takes about 20 minutes to complete Pro: no double space, no need to recreate the index, quite fast Cons: not as fast as HASH
MODIFY with HASH solution, by Me:
data HUGE;
modify HUGE;
if _n_ = 1 then do;
declare hash h(dataset:'SMALL');
h.definekey('K1','K2');
h.definedone();
end;
if h.find() ^= 0 then remove;
run;
I taught I had a good idea mixing your solution, however after 1 hour I aborted it
So I think I'll use the first solution, any other comments and suggestions are wery welcome!
Thanks a lot
Regards
... View more