DATA Step, Macro, Functions and more

Deleting rows in a SAS large datasets

Reply
Frequent Contributor
Posts: 122

Deleting rows in a SAS large datasets

Hi,

 

I would like to know which is the quickest way to delete rows in a large SAS dataset.
I have tested the ones of the above code. The quickest is the first one, but I want
to know if that is a general rule or it depends on the delete you are doing.

Any other and better way to delete will be apprecciated.

 

 

data ventas;
set ventas;
if cod_centro in ('C1', 'C2') then delete;
run;

data ventas;
set ventas;
where cod_centro not in ('C1', 'C2') ;
run;

data ventas;
 modify ventas;
 if cod_centro in ('C1', 'C2') then remove;
run;

proc sql;
delete from ventas where cod_centro in ('C1', 'C2');
quit;

Any advice will be greatly appreciated

Super Contributor
Posts: 440

Re: Deleting rows in a SAS large datasets

[ Edited ]
Posted in reply to juanvg1972

There are a lot of factors to be taken into account here - not least of which is whether or not the column you're deleting on is indexed or is the sort column. Therefore I don't think there's a general rule of thumb you can apply in all cases.

 

There is another option you can try as well

 

data ventas;
	set ventas(where=(cod_centro not in('C1', 'C2'));
run;

 

Frequent Contributor
Posts: 95

Re: Deleting rows in a SAS large datasets

Posted in reply to juanvg1972

It looks like the WHERE approach (your second example) is best, according to section B in this document:

 

http://www.phusewiki.org/docs/2010/2010%20PAPERS/CC03%20Paper.pdf

 

 

Norman.
SAS 9.4 (TS1M4) X64_7PRO WIN 6.1.7601 Service Pack 1 Workstation

Super User
Posts: 5,429

Re: Deleting rows in a SAS large datasets

Posted in reply to juanvg1972
First rule out IF, it works only on data in the PDV.
WHERE on the other hand filter out in the input buffer.
Also, I had some bad experience with DELETE FROM, especially with SPDE data.
Data never sleeps
Trusted Advisor
Posts: 1,022

Re: Deleting rows in a SAS large datasets

Posted in reply to juanvg1972

If you have a large dataset that is expensive to rewrite, you may want to find ways to rewrite (and re-index) infrequently, say every 10th update, or every month for a dataset that is updated daily.  If so, then I'd consider the  proc sql ... delete option or the data ...modify...remove option, because they don't rewrite the dataset, they just mark observations (not variables)  to be deleted.  Then just occasionally use the data ... where approach to physically delete the removed observations.

 

However, when using proc sql ...  delete or data ...modify...remove, remember that the dataset attribute NOBS is not updated.  In the example below,  NOBS=699 in both the DATA _NULL_ step preceding the modify and the one following modify, but NLOBS (number of logical obs) goes from 699 to 487.  This is confirmed by the trailing PROC MEANS reporting only 487 observations (and nmiss=0):

 

data have;
  set sashelp.stocks;
run;

data _null_;
  set have nobs=nrecs ;
  dsid=open('have');
  nlobs=attrn(dsid,'NLOBS');
  rc=close(dsid);
  put nrecs= nlobs= rc=;
  stop;
run;

data have;
  modify have;
  where high<60;
  remove;
run;

data _null_;
  set have nobs=nrecs ;
  dsid=open('have');
  nlobs=attrn(dsid,'NLOBS');
  rc=close(dsid);
  put nrecs= nlobs=;
  stop;
run;

proc means data=have n nmiss;
  var high;
run;

 

 

Unfortunately the SET statement does not have an NLOBS= option, just the NOBS option.  That's why you see the open, attrn, and close functions in the DATA _NULL_ steps.

Ask a Question
Discussion stats
  • 4 replies
  • 187 views
  • 5 likes
  • 5 in conversation