Hi,
I would like to know which is the quickest way to delete rows in a large SAS dataset.
I have tested the ones of the above code. The quickest is the first one, but I want
to know if that is a general rule or it depends on the delete you are doing.
Any other and better way to delete will be apprecciated.
data ventas;
set ventas;
if cod_centro in ('C1', 'C2') then delete;
run;
data ventas;
set ventas;
where cod_centro not in ('C1', 'C2') ;
run;
data ventas;
modify ventas;
if cod_centro in ('C1', 'C2') then remove;
run;
proc sql;
delete from ventas where cod_centro in ('C1', 'C2');
quit;
Any advice will be greatly appreciated
There are a lot of factors to be taken into account here - not least of which is whether or not the column you're deleting on is indexed or is the sort column. Therefore I don't think there's a general rule of thumb you can apply in all cases.
There is another option you can try as well
data ventas;
set ventas(where=(cod_centro not in('C1', 'C2'));
run;
It looks like the WHERE approach (your second example) is best, according to section B in this document:
http://www.phusewiki.org/docs/2010/2010%20PAPERS/CC03%20Paper.pdf
If you have a large dataset that is expensive to rewrite, you may want to find ways to rewrite (and re-index) infrequently, say every 10th update, or every month for a dataset that is updated daily. If so, then I'd consider the proc sql ... delete option or the data ...modify...remove option, because they don't rewrite the dataset, they just mark observations (not variables) to be deleted. Then just occasionally use the data ... where approach to physically delete the removed observations.
However, when using proc sql ... delete or data ...modify...remove, remember that the dataset attribute NOBS is not updated. In the example below, NOBS=699 in both the DATA _NULL_ step preceding the modify and the one following modify, but NLOBS (number of logical obs) goes from 699 to 487. This is confirmed by the trailing PROC MEANS reporting only 487 observations (and nmiss=0):
data have;
set sashelp.stocks;
run;
data _null_;
set have nobs=nrecs ;
dsid=open('have');
nlobs=attrn(dsid,'NLOBS');
rc=close(dsid);
put nrecs= nlobs= rc=;
stop;
run;
data have;
modify have;
where high<60;
remove;
run;
data _null_;
set have nobs=nrecs ;
dsid=open('have');
nlobs=attrn(dsid,'NLOBS');
rc=close(dsid);
put nrecs= nlobs=;
stop;
run;
proc means data=have n nmiss;
var high;
run;
Unfortunately the SET statement does not have an NLOBS= option, just the NOBS option. That's why you see the open, attrn, and close functions in the DATA _NULL_ steps.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.