Re: Efficiently delete rows based on conditions in an indexed table

MarioS_ · Posted 05-02-2018 07:22 AM

Hello,

I have a data set with about 5 million observations. This table contains an index with several columns, including one column for the year and one column for the country. I want to delete all observations of a certain country for a certain year from this table.

if country = 'NZ' & year = 2018 then delete;

Is there an efficient way to do this and make use of the indexed data? So far I have tried the following two solutions, which have a too long runtime of about 30 minutes (both).

proc sql;
    delete from mydata
    where country = 'NZ' & year = 2018;
quit;


data mydata;
   modify mydata;
   if country = 'NZ' & year = 2018 then remove;
  run;

In case it matters: the table is sorted by the country and year columns.

I want to avoid writing a data set where the index is simply recreated. But perhaps it is the most efficient solution?

Thanks for your help.

Vish33 · Posted 05-02-2018 08:43 AM

Hi,

I would suggest to create a new data set when there is a data manipulation. I always prefer to retain the original data and keep sub sets of data into new temp tables.

Thanks,

Vishnu

LinusH · Posted 05-02-2018 11:04 AM

Just let us face it, Base SAS isn't that good when it comes to deletes in place, and with index updates in parallel.

So, often it's more efficient to replace the table.

SAS/SPDS has a fast path delete, and I assume it performs fairly good compare to Base/SPDE engines. But I guess you don't have that?

Data never sleeps

SASKiwi · Posted 05-02-2018 11:44 PM

I suggest you compare a DELETE with indexes versus a DELETE without indexes. I don't think IF statements use indexes so I don't think it will be faster with indexes, probably slower as the indexes have to be updated for deletions as well. I suspect an IF with DELETE and no indexes is quicker.

ChrisNZ · Posted 05-03-2018 04:49 AM

The MODIFY statement is usually inefficient.

Try this:

data LOOKUP;
  retain COUNTRY 'NZ' YEAR 2018 ;
run;
data MAIN;
  update MAIN LOOKUP(in=DEL);
  by COUNTRY YEAR;
  if DEL then delete;
run;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 05-03-2018 04:58 AM

Or this:

data LOOKUP;
  set MAIN(obs=0);
run;
data MAIN;
  update MAIN LOOKUP;
  by COUNTRY YEAR;
  if COUNTRY = 'NZ' & YEAR = 2018 then delete;
run;

High-Performance SAS Coding - Third Edition

mkeintz · Posted 05-04-2018 12:35 AM

Is it a composite index, as here? Then just use a MODIFY, followed by a WHERE followed by REMOVE. The index will be used, presumably saving a good deal of input operations:

data have (index=(na=(name age)));
  set sashelp.class;
  do i=1 to 100*_n_;output;end;
run;

options msglevel=I;
data have;
  modify have;
  where name='John' and age=12;
  remove;
run;

If you just have two simple indexes, choose one of them (the more discriminating one) for the WHERE statement, followed by an

IF... THEN REMOVE statement conditioned on the other variable.

The MGLEVEL=I option tells SAS to report (among other things) when it is using the data set index.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 05-06-2018 06:15 PM

@mkeintz

UPDATE can be much faster than MODIFY, depending on the proportion of the table that is deleted.

If we delete 10% of the table, the MODIFY step takes 3.8s on my machine:

data HAVE (sortedby=I index=(I));
  do I=1 to 5e6;output;end;
run;
data HAVE;
  modify HAVE;
  where I<5e5;
  remove;
run;

while the UPDATE step takes 0.8s:

data HAVE (sortedby=I index=(I));
  do I=1 to 5e6;output;end;
run;
data LOOKUP;
  set HAVE(obs=0);
run;
data HAVE;
  update HAVE LOOKUP;
  by I;
  if I<5e5 then delete;
run;

If the proportion of the table to delete is small (or if the key is not unique), MODIFY becomes preferable *provided an index is available*.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 05-07-2018 10:54 AM

@ChrisNZ

I agree that proportion to be deleted impacts the performance of the MODIFY technique I suggested.

But when comparing it to the UPDATE technique, the width of the records in dataset HAVE is just as important, and at some point becomes more important in the MODIFY vs UPDATE contrast.

For instance, I added a single $200 byte satellite variable to dataset have and got 20.2 seconds for the UPDATE and 4.9 for the MODIFY.

data HAVE (sortedby=I index=(I));
  length x $200; retain x ' ';
  do I=1 to 5e6;output;end;
run;
data lookup ; set have (obs=0); run;
data HAVE (sortedby=I index=(i));
  update HAVE lookup;
  by I;
  if I<5e5 then delete;
run;

data HAVE (sortedby=I index=(I));
  length x $200; retain x ' ';
  do I=1 to 5e6;output;end;
run;
data HAVE;
  modify HAVE;
  where I<5e5;
  remove;
run;

Also, while I think that cases where an index is available and delete-proportion is small, then MODIFY with WHERE is the way to go, I don't agree that presence of an index is required to choose MODIFY over UPDATE. The program below, using MODIFY without benefit of the index against the same dataset as above, took 6 seconds on my machine.

data HAVE (sortedby=I index=(I));
  length x $200; retain x ' ';
  do I=1 to 5e6;output;end;
run;
data have;
  modify have;
  if I<5e5 then remove;
run;

Proportion matters, but so does (horizontal) size.

Edited addition: The MODIFY statement preserves the index in have. And you could improve the performance of the UPDATE test I ran by dropping the index creation, but I suspect the OP would want to preserve the index.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 05-07-2018 06:35 PM

@mkeintz I can't disagree with your conclusion!

Yes indeed, index performance does not deteriorate with width, only with height, whereas a join BY is very much impacted by width.

Very interesting. Thank you for your input.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 05-08-2018 10:37 PM

@mkeintz

The reason I ruled out using MODIFY for doing deletions is that in my experience, UPDATE is faster when a look up table is used, rather than a static value like we have here.

Do you have a different experience?

data MAIN1(index=(I)) MAIN2 UPD; 
  do I=1 to 1e7; output MAIN1 MAIN2; if ranuni(1)>.9 then output UPD; end; 
run;

* hours? ;
data MAIN; 
  modify MAIN UPD; by I; if _IORC_=0 then remove; else output; 
run;

* 5 seconds;
data MAIN1;   
   set UPD; modify MAIN1 key=I; if _IORC_= %sysrc(_sok) then remove; else _ERROR_=0; 
run;

* 2 seconds;
data MAIN2;  
  update MAIN2 UPD(in=UPD); by I; if UPD then delete; 
run;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 05-09-2018 12:22 AM

@mkeintz I answered my own question.

If the table widens (add a variable with length $200) or if the number of updates reduces (replace .9 with .99), MODIFY becomes faster.

As usual... it depends... 🙂

MODIFY BY is never even in the race though. Only MODIFY KEY= should be used, if an index is present.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 05-09-2018 10:02 AM

@ChrisNZ

Yes MODIFY with BY is exceedingly slow, and I don't see why that has to be the case. It eliminates a natural option in using table-driven record deletion/update while avoiding rewriting the complete original data set.

But you don't have to abandon modify to preserve performance while using table driven deletes. You just have to abandon use of BY. I would suggest a hash table, as in:

data HAVE 
     deletes (keep=I);
  length x $200; retain x ' ';
  do I=1 to 5e6;
    output have; 
    if mod(i,100)=0 then output deletes;
  end;
run;

data have;
  if _n_=1 then do;
    declare hash h (dataset:'deletes');
      h.definekey('i');
      h.definedone();
  end;
  modify have ;
  if h.find()=0 then remove;
run;

The advantage, once again, is that this minimizes disk write activity, which becomes more relevant with wide records.

Caveat: The above removes ALL records that match those in DELETES. If the MODIFY with BY approach was fast enough to be used, it would only remove the first record matching each one in DELETES, as does your UPDATES example.

To exactly replicate the remove-only-first property, remove an item from the hash object every time the corresponding record is removed from the dataset:

data have;
  if _n_=1 then do;
    declare hash h (dataset:'deletes');
      h.definekey('i');
      h.definedone();
  end;
  modify have ;
  if h.find()=0 then do; 
    remove; 
    h.remove(); 
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 05-10-2018 01:44 AM

@mkeintz Hash tables have never proven useful to me for this.

Even if you consider the time to build an index, and if you use the check() method, hash is still slower.

data have1 have2 
     deletes (keep=I);
  length x $200; retain x ' ';
  do I=1 to 5e6;
    output have1 have2; 
    if mod(i,100)=0 then output deletes;
  end;
run;
 
data have1;
  if _n_=1 then do;
    declare hash h (dataset:'deletes');
      h.definekey('i');
      h.definedone();
  end;
  modify have1 ;
  if h.check()=0 then remove;
run;

proc sql; 
  create index I on have2;
data have2; 
  set deletes; 
  modify have2 key=I;  
  if _IORC_= %sysrc(_sok) then remove; else _ERROR_=0; 
run;

Hash: real time 4.3 seconds

Index+key=: 1.4 + 0.7 = 2.3 seconds

High-Performance SAS Coding - Third Edition

mkeintz · Posted 05-10-2018 01:17 PM

@ChrisNZ

Your example does not demonstrate a hash table vs index contrast. What it does show is (when deleting uniformly distributed 1% of the cases), that a MODIFY sequentially accessing the entire HAVE is slower than table-driven direct access MODIFY.

Having said that, I agree that hashing the deletes table can be slower than sequentially reading it as a driver for modify.

Your example is HAVE2 below, while the proper hash analog is HAVE3, both of which are faster than my non-indexed original example (HAVE1):

SET DELETES (in have2) is faster (0.9 seconds) than
DECLARE HASH H (datasets:'deletes') in have3 as 2.4 seconds:

data HAVE1 HAVE2 HAVE3
     deletes (keep=I);
  length x $200; retain x ' ';
  do I=1 to 5e6;
    output have1 have2 have3;
    if mod(i,100)=0 then output deletes;
  end;
run; 

data have1;
  if _n_=1 then do;
    declare hash h (dataset:'deletes');
      h.definekey('i');
      h.definedone();
  end;
  modify have1 ;
  if h.find()=0 then remove;
run;

proc sql; 
  create index I on have2;
data have2; 
  set deletes; 
  modify have2 key=I;  
  if _IORC_= %sysrc(_sok) then remove; else _ERROR_=0; 
run;

proc sql; 
  create index I on have3;
data have3;
  declare hash h (dataset:'deletes');
    h.definekey('i');
    h.definedone();
  declare hiter hi ('h');
  do rc=hi.first() while (rc=0);
    modify have3 key=I;  
    if _IORC_= %sysrc(_sok) then remove; else _ERROR_=0; 
    rc=hi.next();
  end;
  stop;
run;

Editted correction. This code only deletes one record, as noted by @ChrisNZ, because it doesn't iterate.

do rc=hi.first() while (rc=0);
   modify have3 key=I;
   if _IORC_=%sysrc(_sok) then remove; else _ERROR_=0;
   rc=hi.next();
end;

It should have a "by 0" in the DO statement to permit iteration and deletion of all the records:

do rc=hi.first() by 0 while (rc=0);
   modify have3 key=I;
   if _IORC_=%sysrc(_sok) then remove; else _ERROR_=0;
   rc=hi.next();
end;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Efficiently delete rows based on conditions in an indexed table