topic Re: Fastest way to delete rows from a dataset by key in SAS Programming

Fastest way to delete rows from a dataset by key

Edoedoedo — Thu, 16 Jun 2016 15:33:03 GMT

Hi,

I have a huge dataset (let's name it HUGE). HUGE has:

- 500 million rows

- a composite unique primary key, let's name it (K1,K2) (two columns belong to the key)

- some other columns in addition to the key

Next I have a small dataset (let's name it SMALL). SMALL has:

- 200k rows

- exactly the same composite unique primary key (K1,K2)

- no more columns in addition to the key.

I.e.:

- SMALL: K1 | K2, UPK(K1,K2)

I need to delete from HUGE all the rows whose key appears in SMALL, so that after this process HUGE and SMALL do not share any common key.

I first tried with PROC SQL and DELETE FROM, but I got crazy about how to use the right syntax... on a dbms I'd have done:

DELETE FROM HUGE WHERE (K1,K2) NOT IN (SELECT K1,K2 FROM SMALL)

but SAS does not allow this syntax.

Next I tried with this:

data HUGE;
merge HUGE(in=in1) SMALL(in=in2); by K1 K2;
if (in1 and in2) then delete;
run;

This works, but it takes almost a day to complete due to HUGE size.

What would you recommend?

Thanks a lot

Regards

Re: Fastest way to delete rows from a dataset by key

KachiM — Thu, 16 Jun 2016 16:08:43 GMT

Try a Hash Solution. The SMALL goes to Hash table. The HUGE can be processed one record after another. The Matchin record will be ignored.

data want;
if _n_ = 1 then do;
if 0 then set SMALL;
declare hash h(dataset:'SMALL');
h.definekey('K1', 'K2');
h.definedone();
end;
set HUGE;
if h.find() ^= 0;
run;

Re: Fastest way to delete rows from a dataset by key

PGStats — Thu, 16 Jun 2016 17:23:39 GMT

It's hard to know what will be optimized, beforehand. Try :

DELETE FROM HUGE as H
WHERE exists (select K1, K2 from SMALL where K1=H.K1 and K2=H.K2);

Re: Fastest way to delete rows from a dataset by key

LinusH — Fri, 17 Jun 2016 15:50:43 GMT

Hey @KachiM, because hash methods can be quick in some situations sue to in memory techniques, it doesn't mean that us suitable for any problem.
The trick is not to access the whole table, that's why we invented indexes.
I blame @Ksharp for this overuse of hashing 😆

Re: Fastest way to delete rows from a dataset by key

LinusH — Fri, 17 Jun 2016 16:13:49 GMT

Usually in data warehouse you don't delete any records, just mark certain records as not valid any more, perhaps using end dates.

That said, an alternative to SQL DELETE FROM (if you can't get the speed up in that scenario) you should be able to use MODIFY with REMOVE in a data step. I never used it but should be quite fast since it will use your index.

Re: Fastest way to delete rows from a dataset by key

KachiM — Fri, 17 Jun 2016 17:17:58 GMT

Hi LinusH:

I do not understand your comment.

To me, Hash, Array or any other SAS way to solve a problem hinges on either minimizing run-time or memory or both. The data type of K1 and K2 is not known, I used Hash Object. If they were NUMBERs, I would have solved the problem by using Array.

I will appreciate the superiority of a method must be empirically verified before offering comments.

Regards

Re: Fastest way to delete rows from a dataset by key

PGStats — Fri, 17 Jun 2016 17:59:13 GMT

It's not always about minimizing "run-time or memory or both" but often about minimysing data transfer times. A hash solution will always require the transfer of whole table(s) into memory, and that can take a long time. Which is what @LinusH was talking about. Optimal procedures will be the best compromise within the constraints set by CPU, memory and network resources.

Re: Fastest way to delete rows from a dataset by key

KachiM — Fri, 17 Jun 2016 18:09:23 GMT

I must have mentioned I/O time in addtion to Run-time.

Re: Fastest way to delete rows from a dataset by key

Ksharp — Sat, 18 Jun 2016 06:53:39 GMT

@LinusH
I am enjoining use Hash Table. It seems it is my last resort when I face a tough question and don't know what to do.
I think index should be retired .   *^_^*

Re: Fastest way to delete rows from a dataset by key

ChrisNZ — Mon, 20 Jun 2016 00:14:34 GMT

As mentioned (sorry if I repeat here some of the valid points made before), the key for speed for this type of operation is not to read the whole table and write it out again.

So no

proc sql ; create ...

data HUGE; set/merge HUGE;

Instead modifying the table in place is the way to go.

The data step's modify statement and proc sql's delete statements are prime candidates for this. They allow "soft deletion" where the observation is still there (hence no rewritting the whole data set) but marked as deleted. proc contents shows the deleted observations.

Sadly, proc sql doesn't use the index for the syntax that @PGStats proposes (I can't find a better way, any other proposal?), but the modify statement does and hence is very fast.

data HUGE(index=(A=(I J))) ;
  array A [50];
  do I=1 to 50e6; 
    J=I;
    output;
  end;
run;
data SMALL;
  do I=1 to 50e6; 
    J=I;
    if ranuni(0)> .9999 then output;
  end;
run;

options msglevel=i;

%* MODIFY statement: 0.5 seconds;
data HUGE;
  set SMALL ;
  modify HUGE key=A;
  if _iorc_=%sysrc(_sok) then remove;
  else _error_=0;
run;
proc contents data=HUGE; 
run;

%* MERGE statement: 110 seconds;
data HUGE;
  merge HUGE SMALL(in=BAD);
  by I J;
  if not BAD;
run;
 
%* DELETE SQL statement: hours?;
proc sql;
   delete from HUGE as H
   where exists (select I, J from SMALL where I=H.I and J=H.J);
quit;

proc contents data=HUGE; 
run;

More benchmarks and discussions about performance in https://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490

Re: Fastest way to delete rows from a dataset by key

PGStats — Mon, 20 Jun 2016 03:31:36 GMT

If table HUGE has a single field UNIQUE index, it seems you can give proc SQL a decent performance by translating your two-fields key into the single key and use IN instead of EXISTS. The only reason I see for going to the trouble of doing this is the rollback ability of SQL which isn't provided by datastep MERGE or MODIFY operations. I modified @ChrisNZ test code as follows

options msglevel=i;

data HUGE(index=(A=(I J) ID/unique)) ;
  array A [50];
  do I=1 to 10e6; 
    J=I;
    ID + 1;
    output;
  end;
run;

data SMALL;
  do I=1 to 10e6; 
    J=I;
    if ranuni(0)> .9999 then output;
  end;
run;

proc sql /* undo_policy=none */ ;
/* Optimized with index A */
create table IDS as 
select H.ID 
from HUGE as H inner join SMALL as S on H.i=S.i and H.j=S.j; 
/* No mention of index ID, but runs in a couple of minutes */
delete from HUGE as H
where ID in (select ID from IDS);
quit;

in my tests, adding undo_policy=none didn't improve performance.

Re: Fastest way to delete rows from a dataset by key

ChrisNZ — Mon, 20 Jun 2016 22:13:41 GMT

@PGStats SAS did not use the ID index in your example.

Even using the idxwhere= option doesn't trigger the index.

A fixed list does though. I reported this issue long ago and it seems it is still here.

data HUGE1(index=(A=(I J) ID/unique)) 
     HUGE2(index=(A=(I J) ID/unique)) 
     SMALL(keep=I J);
  array A [50];
  do I=1 to 10e6; 
    J=I;
    ID + 1;
    if ranuni(0)> .9999 then output SMALL;
    output HUGE1 HUGE2;
  end;
run;

proc sql noprint; /* Optimized with index A */
create table IDS as 
  select H.ID 
  from HUGE1 as H inner join SMALL as S 
  on H.i=S.i and H.j=S.j; 
select ID into :ids separated by ',' from IDS;
quit;

proc sql;  /* No index used: 11 seconds */
delete from HUGE1(idxwhere=yes) where ID in (select ID from IDS);
quit;

proc sql; /* index ID used: 0.2 seconds */
delete from HUGE2 where ID in (&IDS);
quit;

The modify statement looks like the easiest way to speed up this job.

@Edoedoedo Did this work for you?

Re: Fastest way to delete rows from a dataset by key

Edoedoedo — Wed, 22 Jun 2016 08:32:41 GMT

Hi guys,

I tried all your suggestions, here are the results. (HUGE is 70GB, SMALL is 10MB)

HASH solution, by @KachiM: amazing, it does all the job in 7 minutes, including CPU time and I/O time (writing 70GB on disk takes time, just doing "cp HUGE HUGE2" takes 3 minutes)
Pro: fastest solution
Cons: since it creates a new table, it wastes I/O time, needs double disk space, and needs to recreate the index after the operation
Comments: I didn't know hashes, so I read about them on documentation; in your opinion, do you think that loading both tables in memory would speed up more? Are there more sophisticated strategies to increase performance with hashes? Moreover, what happens if the system does not have enough ram to hold the tables, it just slows down or sas crashes?
DELETE solution, by @PGStats: thank you for the correct syntax, it will be surely useful! However after 1 hour I aborted it.
Pro: Rollback available, no double space, no need to recreate the index
Cons: too much time
Comments: too much time for production environment
MODIFY solution, by @ChrisNZ: quite good solution, it takes about 20 minutes to complete
Pro: no double space, no need to recreate the index, quite fast
Cons: not as fast as HASH

MODIFY with HASH solution, by Me:

data HUGE;
    modify HUGE;
    if _n_ = 1 then do;
        declare hash h(dataset:'SMALL');
        h.definekey('K1','K2');
        h.definedone();
    end;
    if h.find() ^= 0 then remove; 
run;

I taught I had a good idea mixing your solution, however after 1 hour I aborted it

So I think I'll use the first solution, any other comments and suggestions are wery welcome!

Thanks a lot

Regards

Re: Fastest way to delete rows from a dataset by key

ChrisNZ — Wed, 22 Jun 2016 08:49:35 GMT

Use check() instead of find(). But I doubt it will make a huge difference.
Sadly or its not rare that sas is much faster copying a whole table than updating it in place. I guess it depends on the proportion of records to update.
Well done!

Re: Fastest way to delete rows from a dataset by key

ChrisNZ — Wed, 22 Jun 2016 08:52:40 GMT

Also did you count the time recreating the index?

Re: Fastest way to delete rows from a dataset by key

Ksharp — Wed, 22 Jun 2016 08:55:46 GMT

It is best position for PROC DS2.

data have;
 set sashelp.class;
run;


proc ds2;
data Male(overwrite=yes);
 method run();
  set have;
  if sex eq 'M' then output;
 end;
enddata;
run;
quit;

Re: Fastest way to delete rows from a dataset by key

Edoedoedo — Wed, 22 Jun 2016 08:56:52 GMT

Yes, approximately 5 minutes

Re: Fastest way to delete rows from a dataset by key

Ksharp — Wed, 22 Jun 2016 08:58:40 GMT

Use RETURN instead of DELETE of data step.

data have;
 set sashelp.class;
run;


proc ds2;
data Non_male(overwrite=yes);
 method run();
  set have;
  if sex eq 'M' then return;
 end;
enddata;
run;
quit;

Re: Fastest way to delete rows from a dataset by key

ChrisNZ — Wed, 22 Jun 2016 09:10:25 GMT

Since you are now doing a sequential copy (at which sas excels) read my comments in https://communities.sas.com/t5/Base-SAS-Programming/How-to-speed-up-performance-of-Sas-9-4-TS1M3-through-enterprise/m-p/276653/

Re: Fastest way to delete rows from a dataset by key

ChrisNZ — Wed, 22 Jun 2016 09:01:56 GMT

So 5 minutes out of 7 are spent recreating the index? really?