06-16-2016 11:33 AM
I have a huge dataset (let's name it HUGE). HUGE has:
- 500 million rows
- a composite unique primary key, let's name it (K1,K2) (two columns belong to the key)
- some other columns in addition to the key
Next I have a small dataset (let's name it SMALL). SMALL has:
- 200k rows
- exactly the same composite unique primary key (K1,K2)
- no more columns in addition to the key.
- HUGE: K1 | K2 | FIELD1 | ANOTHERFIELD2 | SOMEOTHERFIELD3 | ETC, UPK(K1,K2)
- SMALL: K1 | K2, UPK(K1,K2)
I need to delete from HUGE all the rows whose key appears in SMALL, so that after this process HUGE and SMALL do not share any common key.
I first tried with PROC SQL and DELETE FROM, but I got crazy about how to use the right syntax... on a dbms I'd have done:
DELETE FROM HUGE WHERE (K1,K2) NOT IN (SELECT K1,K2 FROM SMALL)
but SAS does not allow this syntax.
Next I tried with this:
merge HUGE(in=in1) SMALL(in=in2); by K1 K2;
if (in1 and in2) then delete;
This works, but it takes almost a day to complete due to HUGE size.
What would you recommend?
Thanks a lot
06-16-2016 12:08 PM
Try a Hash Solution. The SMALL goes to Hash table. The HUGE can be processed one record after another. The Matchin record will be ignored.
if _n_ = 1 then do;
if 0 then set SMALL;
declare hash h(dataset:'SMALL');
if h.find() ^= 0;
06-16-2016 01:23 PM
It's hard to know what will be optimized, beforehand. Try :
DELETE FROM HUGE as H WHERE exists (select K1, K2 from SMALL where K1=H.K1 and K2=H.K2);
06-17-2016 11:50 AM
06-17-2016 01:17 PM
I do not understand your comment.
To me, Hash, Array or any other SAS way to solve a problem hinges on either minimizing run-time or memory or both. The data type of K1 and K2 is not known, I used Hash Object. If they were NUMBERs, I would have solved the problem by using Array.
I will appreciate the superiority of a method must be empirically verified before offering comments.
06-17-2016 01:59 PM
It's not always about minimizing "run-time or memory or both" but often about minimysing data transfer times. A hash solution will always require the transfer of whole table(s) into memory, and that can take a long time. Which is what @LinusH was talking about. Optimal procedures will be the best compromise within the constraints set by CPU, memory and network resources.
06-18-2016 02:53 AM
@LinusH I am enjoining use Hash Table. It seems it is my last resort when I face a tough question and don't know what to do. I think index should be retired . *^_^*
06-17-2016 12:13 PM
06-19-2016 08:14 PM
As mentioned (sorry if I repeat here some of the valid points made before), the key for speed for this type of operation is not to read the whole table and write it out again.
proc sql ; create ...
data HUGE; set/merge HUGE;
Instead modifying the table in place is the way to go.
The data step's modify statement and proc sql's delete statements are prime candidates for this. They allow "soft deletion" where the observation is still there (hence no rewritting the whole data set) but marked as deleted. proc contents shows the deleted observations.
Sadly, proc sql doesn't use the index for the syntax that @PGStats proposes (I can't find a better way, any other proposal?), but the modify statement does and hence is very fast.
data HUGE(index=(A=(I J))) ; array A ; do I=1 to 50e6; J=I; output; end; run; data SMALL; do I=1 to 50e6; J=I; if ranuni(0)> .9999 then output; end; run; options msglevel=i; %* MODIFY statement: 0.5 seconds; data HUGE; set SMALL ; modify HUGE key=A; if _iorc_=%sysrc(_sok) then remove; else _error_=0; run; proc contents data=HUGE; run; %* MERGE statement: 110 seconds; data HUGE; merge HUGE SMALL(in=BAD); by I J; if not BAD; run; %* DELETE SQL statement: hours?; proc sql; delete from HUGE as H where exists (select I, J from SMALL where I=H.I and J=H.J); quit; proc contents data=HUGE; run;
More benchmarks and discussions about performance in https://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490
06-19-2016 11:31 PM
If table HUGE has a single field UNIQUE index, it seems you can give proc SQL a decent performance by translating your two-fields key into the single key and use IN instead of EXISTS. The only reason I see for going to the trouble of doing this is the rollback ability of SQL which isn't provided by datastep MERGE or MODIFY operations. I modified @ChrisNZ test code as follows
options msglevel=i; data HUGE(index=(A=(I J) ID/unique)) ; array A ; do I=1 to 10e6; J=I; ID + 1; output; end; run; data SMALL; do I=1 to 10e6; J=I; if ranuni(0)> .9999 then output; end; run; proc sql /* undo_policy=none */ ; /* Optimized with index A */ create table IDS as select H.ID from HUGE as H inner join SMALL as S on H.i=S.i and H.j=S.j; /* No mention of index ID, but runs in a couple of minutes */ delete from HUGE as H where ID in (select ID from IDS); quit;
in my tests, adding undo_policy=none didn't improve performance.
06-20-2016 06:10 PM - edited 06-20-2016 06:13 PM
@PGStats SAS did not use the ID index in your example.
Even using the idxwhere= option doesn't trigger the index.
A fixed list does though. I reported this issue long ago and it seems it is still here.
data HUGE1(index=(A=(I J) ID/unique)) HUGE2(index=(A=(I J) ID/unique)) SMALL(keep=I J); array A ; do I=1 to 10e6; J=I; ID + 1; if ranuni(0)> .9999 then output SMALL; output HUGE1 HUGE2; end; run; proc sql noprint; /* Optimized with index A */ create table IDS as select H.ID from HUGE1 as H inner join SMALL as S on H.i=S.i and H.j=S.j; select ID into :ids separated by ',' from IDS; quit; proc sql; /* No index used: 11 seconds */ delete from HUGE1(idxwhere=yes) where ID in (select ID from IDS); quit; proc sql; /* index ID used: 0.2 seconds */ delete from HUGE2 where ID in (&IDS); quit;
The modify statement looks like the easiest way to speed up this job.
@EduxEdux Did this work for you?
06-22-2016 04:32 AM
I tried all your suggestions, here are the results. (HUGE is 70GB, SMALL is 10MB)
I taught I had a good idea mixing your solution, however after 1 hour I aborted it
data HUGE; modify HUGE; if _n_ = 1 then do; declare hash h(dataset:'SMALL'); h.definekey('K1','K2'); h.definedone(); end; if h.find() ^= 0 then remove; run;
So I think I'll use the first solution, any other comments and suggestions are wery welcome!
Thanks a lot
06-22-2016 04:55 AM
It is best position for PROC DS2.
data have; set sashelp.class; run; proc ds2; data Male(overwrite=yes); method run(); set have; if sex eq 'M' then output; end; enddata; run; quit;