Hello,
I have a data set with about 5 million observations. This table contains an index with several columns, including one column for the year and one column for the country. I want to delete all observations of a certain country for a certain year from this table.
if country = 'NZ' & year = 2018 then delete;
Is there an efficient way to do this and make use of the indexed data? So far I have tried the following two solutions, which have a too long runtime of about 30 minutes (both).
proc sql; delete from mydata where country = 'NZ' & year = 2018; quit; data mydata; modify mydata; if country = 'NZ' & year = 2018 then remove; run;
In case it matters: the table is sorted by the country and year columns.
I want to avoid writing a data set where the index is simply recreated. But perhaps it is the most efficient solution?
Thanks for your help.
Hi,
I would suggest to create a new data set when there is a data manipulation. I always prefer to retain the original data and keep sub sets of data into new temp tables.
Thanks,
Vishnu
Just let us face it, Base SAS isn't that good when it comes to deletes in place, and with index updates in parallel.
So, often it's more efficient to replace the table.
SAS/SPDS has a fast path delete, and I assume it performs fairly good compare to Base/SPDE engines. But I guess you don't have that?
I suggest you compare a DELETE with indexes versus a DELETE without indexes. I don't think IF statements use indexes so I don't think it will be faster with indexes, probably slower as the indexes have to be updated for deletions as well. I suspect an IF with DELETE and no indexes is quicker.
The MODIFY statement is usually inefficient.
Try this:
data LOOKUP;
retain COUNTRY 'NZ' YEAR 2018 ;
run;
data MAIN;
update MAIN LOOKUP(in=DEL);
by COUNTRY YEAR;
if DEL then delete;
run;
Or this:
data LOOKUP;
set MAIN(obs=0);
run;
data MAIN;
update MAIN LOOKUP;
by COUNTRY YEAR;
if COUNTRY = 'NZ' & YEAR = 2018 then delete;
run;
Is it a composite index, as here? Then just use a MODIFY, followed by a WHERE followed by REMOVE. The index will be used, presumably saving a good deal of input operations:
data have (index=(na=(name age)));
set sashelp.class;
do i=1 to 100*_n_;output;end;
run;
options msglevel=I;
data have;
modify have;
where name='John' and age=12;
remove;
run;
If you just have two simple indexes, choose one of them (the more discriminating one) for the WHERE statement, followed by an
IF... THEN REMOVE statement conditioned on the other variable.
The MGLEVEL=I option tells SAS to report (among other things) when it is using the data set index.
UPDATE can be much faster than MODIFY, depending on the proportion of the table that is deleted.
If we delete 10% of the table, the MODIFY step takes 3.8s on my machine:
data HAVE (sortedby=I index=(I)); do I=1 to 5e6;output;end; run; data HAVE; modify HAVE; where I<5e5; remove; run;
while the UPDATE step takes 0.8s:
data HAVE (sortedby=I index=(I)); do I=1 to 5e6;output;end; run; data LOOKUP; set HAVE(obs=0); run; data HAVE; update HAVE LOOKUP; by I; if I<5e5 then delete; run;
If the proportion of the table to delete is small (or if the key is not unique), MODIFY becomes preferable *provided an index is available*.
I agree that proportion to be deleted impacts the performance of the MODIFY technique I suggested.
But when comparing it to the UPDATE technique, the width of the records in dataset HAVE is just as important, and at some point becomes more important in the MODIFY vs UPDATE contrast.
For instance, I added a single $200 byte satellite variable to dataset have and got 20.2 seconds for the UPDATE and 4.9 for the MODIFY.
data HAVE (sortedby=I index=(I));
length x $200; retain x ' ';
do I=1 to 5e6;output;end;
run;
data lookup ; set have (obs=0); run;
data HAVE (sortedby=I index=(i));
update HAVE lookup;
by I;
if I<5e5 then delete;
run;
data HAVE (sortedby=I index=(I));
length x $200; retain x ' ';
do I=1 to 5e6;output;end;
run;
data HAVE;
modify HAVE;
where I<5e5;
remove;
run;
Also, while I think that cases where an index is available and delete-proportion is small, then MODIFY with WHERE is the way to go, I don't agree that presence of an index is required to choose MODIFY over UPDATE. The program below, using MODIFY without benefit of the index against the same dataset as above, took 6 seconds on my machine.
data HAVE (sortedby=I index=(I));
length x $200; retain x ' ';
do I=1 to 5e6;output;end;
run;
data have;
modify have;
if I<5e5 then remove;
run;
Proportion matters, but so does (horizontal) size.
Edited addition: The MODIFY statement preserves the index in have. And you could improve the performance of the UPDATE test I ran by dropping the index creation, but I suspect the OP would want to preserve the index.
@mkeintz I can't disagree with your conclusion!
Yes indeed, index performance does not deteriorate with width, only with height, whereas a join BY is very much impacted by width.
Very interesting. Thank you for your input.
The reason I ruled out using MODIFY for doing deletions is that in my experience, UPDATE is faster when a look up table is used, rather than a static value like we have here.
Do you have a different experience?
data MAIN1(index=(I)) MAIN2 UPD; do I=1 to 1e7; output MAIN1 MAIN2; if ranuni(1)>.9 then output UPD; end; run; * hours? ; data MAIN; modify MAIN UPD; by I; if _IORC_=0 then remove; else output; run; * 5 seconds; data MAIN1; set UPD; modify MAIN1 key=I; if _IORC_= %sysrc(_sok) then remove; else _ERROR_=0; run; * 2 seconds; data MAIN2; update MAIN2 UPD(in=UPD); by I; if UPD then delete; run;
@mkeintz I answered my own question.
If the table widens (add a variable with length $200) or if the number of updates reduces (replace .9 with .99), MODIFY becomes faster.
As usual... it depends... 🙂
MODIFY BY is never even in the race though. Only MODIFY KEY= should be used, if an index is present.
Yes MODIFY with BY is exceedingly slow, and I don't see why that has to be the case. It eliminates a natural option in using table-driven record deletion/update while avoiding rewriting the complete original data set.
But you don't have to abandon modify to preserve performance while using table driven deletes. You just have to abandon use of BY. I would suggest a hash table, as in:
data HAVE
deletes (keep=I);
length x $200; retain x ' ';
do I=1 to 5e6;
output have;
if mod(i,100)=0 then output deletes;
end;
run;
data have;
if _n_=1 then do;
declare hash h (dataset:'deletes');
h.definekey('i');
h.definedone();
end;
modify have ;
if h.find()=0 then remove;
run;
The advantage, once again, is that this minimizes disk write activity, which becomes more relevant with wide records.
Caveat: The above removes ALL records that match those in DELETES. If the MODIFY with BY approach was fast enough to be used, it would only remove the first record matching each one in DELETES, as does your UPDATES example.
To exactly replicate the remove-only-first property, remove an item from the hash object every time the corresponding record is removed from the dataset:
data have;
if _n_=1 then do;
declare hash h (dataset:'deletes');
h.definekey('i');
h.definedone();
end;
modify have ;
if h.find()=0 then do;
remove;
h.remove();
end;
run;
@mkeintz Hash tables have never proven useful to me for this.
Even if you consider the time to build an index, and if you use the check() method, hash is still slower.
data have1 have2
deletes (keep=I);
length x $200; retain x ' ';
do I=1 to 5e6;
output have1 have2;
if mod(i,100)=0 then output deletes;
end;
run;
data have1;
if _n_=1 then do;
declare hash h (dataset:'deletes');
h.definekey('i');
h.definedone();
end;
modify have1 ;
if h.check()=0 then remove;
run;
proc sql;
create index I on have2;
data have2;
set deletes;
modify have2 key=I;
if _IORC_= %sysrc(_sok) then remove; else _ERROR_=0;
run;
Hash: real time 4.3 seconds
Index+key=: 1.4 + 0.7 = 2.3 seconds
Your example does not demonstrate a hash table vs index contrast. What it does show is (when deleting uniformly distributed 1% of the cases), that a MODIFY sequentially accessing the entire HAVE is slower than table-driven direct access MODIFY.
Having said that, I agree that hashing the deletes table can be slower than sequentially reading it as a driver for modify.
Your example is HAVE2 below, while the proper hash analog is HAVE3, both of which are faster than my non-indexed original example (HAVE1):
data HAVE1 HAVE2 HAVE3
deletes (keep=I);
length x $200; retain x ' ';
do I=1 to 5e6;
output have1 have2 have3;
if mod(i,100)=0 then output deletes;
end;
run;
data have1;
if _n_=1 then do;
declare hash h (dataset:'deletes');
h.definekey('i');
h.definedone();
end;
modify have1 ;
if h.find()=0 then remove;
run;
proc sql;
create index I on have2;
data have2;
set deletes;
modify have2 key=I;
if _IORC_= %sysrc(_sok) then remove; else _ERROR_=0;
run;
proc sql;
create index I on have3;
data have3;
declare hash h (dataset:'deletes');
h.definekey('i');
h.definedone();
declare hiter hi ('h');
do rc=hi.first() while (rc=0);
modify have3 key=I;
if _IORC_= %sysrc(_sok) then remove; else _ERROR_=0;
rc=hi.next();
end;
stop;
run;
Editted correction. This code only deletes one record, as noted by @ChrisNZ, because it doesn't iterate.
do rc=hi.first() while (rc=0);
modify have3 key=I;
if _IORC_=%sysrc(_sok) then remove; else _ERROR_=0;
rc=hi.next();
end;
It should have a "by 0" in the DO statement to permit iteration and deletion of all the records:
do rc=hi.first() by 0 while (rc=0);
modify have3 key=I;
if _IORC_=%sysrc(_sok) then remove; else _ERROR_=0;
rc=hi.next();
end;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.