Re: Deleting rows in a SAS large datasets

juanvg1972 · Posted 10-01-2017 05:32 AM

Hi,

I would like to know which is the quickest way to delete rows in a large SAS dataset.
I have tested the ones of the above code. The quickest is the first one, but I want
to know if that is a general rule or it depends on the delete you are doing.

Any other and better way to delete will be apprecciated.

data ventas;
set ventas;
if cod_centro in ('C1', 'C2') then delete;
run;

data ventas;
set ventas;
where cod_centro not in ('C1', 'C2') ;
run;

data ventas;
 modify ventas;
 if cod_centro in ('C1', 'C2') then remove;
run;

proc sql;
delete from ventas where cod_centro in ('C1', 'C2');
quit;

Any advice will be greatly appreciated

ChrisBrooks · Posted 10-01-2017 06:11 AM

There are a lot of factors to be taken into account here - not least of which is whether or not the column you're deleting on is indexed or is the sort column. Therefore I don't think there's a general rule of thumb you can apply in all cases.

There is another option you can try as well

data ventas;
	set ventas(where=(cod_centro not in('C1', 'C2'));
run;

Norman21 · Posted 10-01-2017 06:14 AM

It looks like the WHERE approach (your second example) is best, according to section B in this document:

http://www.phusewiki.org/docs/2010/2010%20PAPERS/CC03%20Paper.pdf

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

LinusH · Posted 10-01-2017 09:12 AM

First rule out IF, it works only on data in the PDV.
WHERE on the other hand filter out in the input buffer.
Also, I had some bad experience with DELETE FROM, especially with SPDE data.

Data never sleeps

mkeintz · Posted 10-01-2017 12:00 PM

If you have a large dataset that is expensive to rewrite, you may want to find ways to rewrite (and re-index) infrequently, say every 10th update, or every month for a dataset that is updated daily. If so, then I'd consider the proc sql ... delete option or the data ...modify...remove option, because they don't rewrite the dataset, they just mark observations (not variables) to be deleted. Then just occasionally use the data ... where approach to physically delete the removed observations.

However, when using proc sql ... delete or data ...modify...remove, remember that the dataset attribute NOBS is not updated. In the example below, NOBS=699 in both the DATA _NULL_ step preceding the modify and the one following modify, but NLOBS (number of logical obs) goes from 699 to 487. This is confirmed by the trailing PROC MEANS reporting only 487 observations (and nmiss=0):

data have;
  set sashelp.stocks;
run;

data _null_;
  set have nobs=nrecs ;
  dsid=open('have');
  nlobs=attrn(dsid,'NLOBS');
  rc=close(dsid);
  put nrecs= nlobs= rc=;
  stop;
run;

data have;
  modify have;
  where high<60;
  remove;
run;

data _null_;
  set have nobs=nrecs ;
  dsid=open('have');
  nlobs=attrn(dsid,'NLOBS');
  rc=close(dsid);
  put nrecs= nlobs=;
  stop;
run;

proc means data=have n nmiss;
  var high;
run;

Unfortunately the SET statement does not have an NLOBS= option, just the NOBS option. That's why you see the open, attrn, and close functions in the DATA _NULL_ steps.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Deleting rows in a SAS large datasets