Re: Fastest way to delete rows from a dataset by key

Edoedoedo · Posted 06-16-2016 11:33 AM

Hi,

I have a huge dataset (let's name it HUGE). HUGE has:

- 500 million rows

- a composite unique primary key, let's name it (K1,K2) (two columns belong to the key)

- some other columns in addition to the key

Next I have a small dataset (let's name it SMALL). SMALL has:

- 200k rows

- exactly the same composite unique primary key (K1,K2)

- no more columns in addition to the key.

I.e.:

- SMALL: K1 | K2, UPK(K1,K2)

I need to delete from HUGE all the rows whose key appears in SMALL, so that after this process HUGE and SMALL do not share any common key.

I first tried with PROC SQL and DELETE FROM, but I got crazy about how to use the right syntax... on a dbms I'd have done:

DELETE FROM HUGE WHERE (K1,K2) NOT IN (SELECT K1,K2 FROM SMALL)

but SAS does not allow this syntax.

Next I tried with this:

data HUGE;
merge HUGE(in=in1) SMALL(in=in2); by K1 K2;
if (in1 and in2) then delete;
run;

This works, but it takes almost a day to complete due to HUGE size.

What would you recommend?

Thanks a lot

Regards

KachiM · Posted 06-16-2016 12:08 PM

Try a Hash Solution. The SMALL goes to Hash table. The HUGE can be processed one record after another. The Matchin record will be ignored.

data want;
if _n_ = 1 then do;
if 0 then set SMALL;
declare hash h(dataset:'SMALL');
h.definekey('K1', 'K2');
h.definedone();
end;
set HUGE;
if h.find() ^= 0;
run;

PGStats · Posted 06-16-2016 01:23 PM

It's hard to know what will be optimized, beforehand. Try :

DELETE FROM HUGE as H
WHERE exists (select K1, K2 from SMALL where K1=H.K1 and K2=H.K2);

PG

LinusH · Posted 06-17-2016 11:50 AM

Hey @KachiM, because hash methods can be quick in some situations sue to in memory techniques, it doesn't mean that us suitable for any problem.
The trick is not to access the whole table, that's why we invented indexes.
I blame @Ksharp for this overuse of hashing 😆

Data never sleeps

KachiM · Posted 06-17-2016 01:17 PM

Hi LinusH:

I do not understand your comment.

To me, Hash, Array or any other SAS way to solve a problem hinges on either minimizing run-time or memory or both. The data type of K1 and K2 is not known, I used Hash Object. If they were NUMBERs, I would have solved the problem by using Array.

I will appreciate the superiority of a method must be empirically verified before offering comments.

Regards

PGStats · Posted 06-17-2016 01:59 PM

It's not always about minimizing "run-time or memory or both" but often about minimysing data transfer times. A hash solution will always require the transfer of whole table(s) into memory, and that can take a long time. Which is what @LinusH was talking about. Optimal procedures will be the best compromise within the constraints set by CPU, memory and network resources.

PG

KachiM · Posted 06-17-2016 02:09 PM

I must have mentioned I/O time in addtion to Run-time.

Ksharp · Posted 06-18-2016 02:53 AM

@LinusH
I am enjoining use Hash Table. It seems it is my last resort when I face a tough question and don't know what to do.
I think index should be retired .   *^_^*

LinusH · Posted 06-17-2016 12:13 PM

Usually in data warehouse you don't delete any records, just mark certain records as not valid any more, perhaps using end dates.

That said, an alternative to SQL DELETE FROM (if you can't get the speed up in that scenario) you should be able to use MODIFY with REMOVE in a data step. I never used it but should be quite fast since it will use your index.

Data never sleeps

ChrisNZ · Posted 06-19-2016 08:14 PM

As mentioned (sorry if I repeat here some of the valid points made before), the key for speed for this type of operation is not to read the whole table and write it out again.

So no

proc sql ; create ...

or

data HUGE; set/merge HUGE;

Instead modifying the table in place is the way to go.

The data step's modify statement and proc sql's delete statements are prime candidates for this. They allow "soft deletion" where the observation is still there (hence no rewritting the whole data set) but marked as deleted. proc contents shows the deleted observations.

Sadly, proc sql doesn't use the index for the syntax that @PGStats proposes (I can't find a better way, any other proposal?), but the modify statement does and hence is very fast.

data HUGE(index=(A=(I J))) ;
  array A [50];
  do I=1 to 50e6; 
    J=I;
    output;
  end;
run;
data SMALL;
  do I=1 to 50e6; 
    J=I;
    if ranuni(0)> .9999 then output;
  end;
run;

options msglevel=i;

%* MODIFY statement: 0.5 seconds;
data HUGE;
  set SMALL ;
  modify HUGE key=A;
  if _iorc_=%sysrc(_sok) then remove;
  else _error_=0;
run;
proc contents data=HUGE; 
run;

%* MERGE statement: 110 seconds;
data HUGE;
  merge HUGE SMALL(in=BAD);
  by I J;
  if not BAD;
run;
 
%* DELETE SQL statement: hours?;
proc sql;
   delete from HUGE as H
   where exists (select I, J from SMALL where I=H.I and J=H.J);
quit;

proc contents data=HUGE; 
run;

More benchmarks and discussions about performance in https://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490

High-Performance SAS Coding - Third Edition

PGStats · Posted 06-19-2016 11:31 PM

If table HUGE has a single field UNIQUE index, it seems you can give proc SQL a decent performance by translating your two-fields key into the single key and use IN instead of EXISTS. The only reason I see for going to the trouble of doing this is the rollback ability of SQL which isn't provided by datastep MERGE or MODIFY operations. I modified @ChrisNZ test code as follows

options msglevel=i;

data HUGE(index=(A=(I J) ID/unique)) ;
  array A [50];
  do I=1 to 10e6; 
    J=I;
    ID + 1;
    output;
  end;
run;

data SMALL;
  do I=1 to 10e6; 
    J=I;
    if ranuni(0)> .9999 then output;
  end;
run;

proc sql /* undo_policy=none */ ;
/* Optimized with index A */
create table IDS as 
select H.ID 
from HUGE as H inner join SMALL as S on H.i=S.i and H.j=S.j; 
/* No mention of index ID, but runs in a couple of minutes */
delete from HUGE as H
where ID in (select ID from IDS);
quit;

in my tests, adding undo_policy=none didn't improve performance.

PG

ChrisNZ · Posted 06-20-2016 06:10 PM

@PGStats SAS did not use the ID index in your example.

Even using the idxwhere= option doesn't trigger the index.

A fixed list does though. I reported this issue long ago and it seems it is still here.

data HUGE1(index=(A=(I J) ID/unique)) 
     HUGE2(index=(A=(I J) ID/unique)) 
     SMALL(keep=I J);
  array A [50];
  do I=1 to 10e6; 
    J=I;
    ID + 1;
    if ranuni(0)> .9999 then output SMALL;
    output HUGE1 HUGE2;
  end;
run;

proc sql noprint; /* Optimized with index A */
create table IDS as 
  select H.ID 
  from HUGE1 as H inner join SMALL as S 
  on H.i=S.i and H.j=S.j; 
select ID into :ids separated by ',' from IDS;
quit;

proc sql;  /* No index used: 11 seconds */
delete from HUGE1(idxwhere=yes) where ID in (select ID from IDS);
quit;

proc sql; /* index ID used: 0.2 seconds */
delete from HUGE2 where ID in (&IDS);
quit;

The modify statement looks like the easiest way to speed up this job.

@Edoedoedo Did this work for you?

High-Performance SAS Coding - Third Edition

Edoedoedo · Posted 06-22-2016 04:32 AM

Hi guys,

I tried all your suggestions, here are the results. (HUGE is 70GB, SMALL is 10MB)

HASH solution, by @KachiM: amazing, it does all the job in 7 minutes, including CPU time and I/O time (writing 70GB on disk takes time, just doing "cp HUGE HUGE2" takes 3 minutes)
Pro: fastest solution
Cons: since it creates a new table, it wastes I/O time, needs double disk space, and needs to recreate the index after the operation
Comments: I didn't know hashes, so I read about them on documentation; in your opinion, do you think that loading both tables in memory would speed up more? Are there more sophisticated strategies to increase performance with hashes? Moreover, what happens if the system does not have enough ram to hold the tables, it just slows down or sas crashes?
DELETE solution, by @PGStats: thank you for the correct syntax, it will be surely useful! However after 1 hour I aborted it.
Pro: Rollback available, no double space, no need to recreate the index
Cons: too much time
Comments: too much time for production environment
MODIFY solution, by @ChrisNZ: quite good solution, it takes about 20 minutes to complete
Pro: no double space, no need to recreate the index, quite fast
Cons: not as fast as HASH

MODIFY with HASH solution, by Me:

data HUGE;
    modify HUGE;
    if _n_ = 1 then do;
        declare hash h(dataset:'SMALL');
        h.definekey('K1','K2');
        h.definedone();
    end;
    if h.find() ^= 0 then remove; 
run;

I taught I had a good idea mixing your solution, however after 1 hour I aborted it

So I think I'll use the first solution, any other comments and suggestions are wery welcome!

Thanks a lot

Regards

Ksharp · Posted 06-22-2016 04:55 AM

It is best position for PROC DS2.

data have;
 set sashelp.class;
run;


proc ds2;
data Male(overwrite=yes);
 method run();
  set have;
  if sex eq 'M' then output;
 end;
enddata;
run;
quit;

Edoedoedo · Posted 06-22-2016 06:34 AM

I didn't know DS2 either, SAS is full of surprises 🙂
@Ksharp could you show me how would you solve the problem using DS2?

Thanks!

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away