10-01-2017 05:32 AM
I would like to know which is the quickest way to delete rows in a large SAS dataset.
I have tested the ones of the above code. The quickest is the first one, but I want
to know if that is a general rule or it depends on the delete you are doing.
Any other and better way to delete will be apprecciated.
data ventas; set ventas; if cod_centro in ('C1', 'C2') then delete; run; data ventas; set ventas; where cod_centro not in ('C1', 'C2') ; run; data ventas; modify ventas; if cod_centro in ('C1', 'C2') then remove; run; proc sql; delete from ventas where cod_centro in ('C1', 'C2'); quit;
Any advice will be greatly appreciated
10-01-2017 06:11 AM - edited 10-01-2017 06:12 AM
There are a lot of factors to be taken into account here - not least of which is whether or not the column you're deleting on is indexed or is the sort column. Therefore I don't think there's a general rule of thumb you can apply in all cases.
There is another option you can try as well
data ventas; set ventas(where=(cod_centro not in('C1', 'C2')); run;
10-01-2017 06:14 AM
It looks like the WHERE approach (your second example) is best, according to section B in this document:
10-01-2017 09:12 AM
10-01-2017 12:00 PM
If you have a large dataset that is expensive to rewrite, you may want to find ways to rewrite (and re-index) infrequently, say every 10th update, or every month for a dataset that is updated daily. If so, then I'd consider the proc sql ... delete option or the data ...modify...remove option, because they don't rewrite the dataset, they just mark observations (not variables) to be deleted. Then just occasionally use the data ... where approach to physically delete the removed observations.
However, when using proc sql ... delete or data ...modify...remove, remember that the dataset attribute NOBS is not updated. In the example below, NOBS=699 in both the DATA _NULL_ step preceding the modify and the one following modify, but NLOBS (number of logical obs) goes from 699 to 487. This is confirmed by the trailing PROC MEANS reporting only 487 observations (and nmiss=0):
data have; set sashelp.stocks; run; data _null_; set have nobs=nrecs ; dsid=open('have'); nlobs=attrn(dsid,'NLOBS'); rc=close(dsid); put nrecs= nlobs= rc=; stop; run; data have; modify have; where high<60; remove; run; data _null_; set have nobs=nrecs ; dsid=open('have'); nlobs=attrn(dsid,'NLOBS'); rc=close(dsid); put nrecs= nlobs=; stop; run; proc means data=have n nmiss; var high; run;
Unfortunately the SET statement does not have an NLOBS= option, just the NOBS option. That's why you see the open, attrn, and close functions in the DATA _NULL_ steps.