Deleting matching records from a huge SAS Dataset

Reply
Frequent Contributor
Posts: 107

Deleting matching records from a huge SAS Dataset

Hi

 

I am trying to figure out the quicker way of deleting the matching records from a huge dataset. The dataset I am trying to delete from would have 1 billion records atleast and the process of deletion with the below code seems to be taking so long. I would appreciate if any one suggests any quicker process.

 

Thanks

 

This is the code I am using.

 

proc sql;
delete from lib1.master
where name in (select name from lib2.transaction);
quit;

 

 

Super User
Super User
Posts: 7,955

Re: Deleting matching records from a huge SAS Dataset

Posted in reply to KiranMaddi

Well, one way you could try is to use the small dataset to generate the code:

data _null_;
  set lib2.transaction end=last;
  if _n_=1 then call execute('data lib1.master_updated; set lib1.master (where=(name not in ('||quote(name));
  else do call execute(cats(',',quote(name)));
  if last then call execute('))); run;');
run;

Not tested, as no test data (form of a datastep provided), but what this should do is create a datastep with a where clause for all your categories, one big datastep, but in resource terms this one datastep should be quicker the proc sql.

Super User
Posts: 5,429

Re: Deleting matching records from a huge SAS Dataset

@RW9- you are rewriting the whole table, and that would be the resource consuming part. As when yo are going with a table scan approach, there is no evidence that a data step is faster than SQL...?

 

1 billion records, and you are using the Base engine...? If you have regular requirements like this, you need to look a different approach. When it comes to delete-from operations, SAS SPD Server has a delete fast-pass functionality. Not sure how it works, but hopefully much better than Base SAS.

 

You <might> get lucky if your dat is indexed by name, but I don't have SAS at my fingertips right nopw, and I can't fins any relevant documentation that discuss delete-from-where and index use. Try on a smaller subset, with _method and options msglevel=i; and see how it responds. 

Data never sleeps
Super User
Super User
Posts: 7,955

Re: Deleting matching records from a huge SAS Dataset

Hi LinusH.  Yes I totally agree, using other software to process this would be ideal, but that can cost.  I suggest the code geration approach as, whilst it does indeed re-write the dataset (I don't think there is a alter table in place method in Base), it doesn't load as much to memory as SQL.  So it should just be read/write to drive.  I suppose it depends on what "resources" you need to save.

PROC Star
Posts: 1,760

Re: Deleting matching records from a huge SAS Dataset

[ Edited ]
Posted in reply to KiranMaddi

 

 

Agree regarding the base engine, and not rewritting the table if possible.


SPDE should be much more efficient for such large table (but see my test below).


As for not rewriting, ideally we'd just set the delete flag rather than rewrite the table if the number of records affected is low.

I don't how SAS can do this outside of SAS/FSP however.


The main question is how many records are deleted. The data step above is not good for a million values.

 

An index will speed up retrieval of the deleted values, but will cost the initial build and the rebuild after deletion.

 

Here is a quick benchmark with some of the options above. SPDE disappoints. Indexes are no good. A create statement is faster than a delete statement. A simple data step is the fastest in this case.

 

You must do a similar benchmark on youy data and your machine in order to decide.

 

More performance benchmarks in http://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490

 

 


libname SPEEDY spde "%sysfunc(pathname(work))";



data SPEEDY.HAVE1 SPEEDY.HAVE2 SPEEDY.HAVE3 SPEEDY.HAVE4 SPEEDY.HAVE5(index=(I)) SPEEDY.HAVE6(index=(I)) WORK.HAVE1 WORK.HAVE2 WORK.HAVE3 WORK.HAVE4 WORK.HAVE5(index=(I)) WORK.HAVE6(index=(I)) ; do I=1 to 50e6; output; end; run; data FILTER; do I=1 to 5000; V=int(ranuni(0)*1e7); output; end; drop I; run;
proc sql; select V into :filter separated by ',' from FILTER; quit;

 

 

data HAVE1;

set HAVE1;

where I not in(&filter);

run;

%* Base Engine, data step;

real time       6.41 seconds

user cpu time   5.61 seconds

system cpu time 0.79 seconds

 

proc sql;

delete from HAVE2

where I in(&filter);

quit;

%* Base Engine, sql delete + list;

real time      10.30 seconds

user cpu time   9.71 seconds

system cpu time 0.59 seconds

 

proc sql;

delete from HAVE3

where I in(select V from FILTER);

quit;

%* Base Engine, sql delete + subquery;

real time       24.57 seconds

user cpu time   23.83 seconds

system cpu time  0.73 seconds

 

proc sql nowarnrecurs;

create table HAVE4 as

select * from HAVE4 where I not in(select V from FILTER);

quit;

%* Base Engine, sql create + subquery;

real time       11.80 seconds

user cpu time   11.23 seconds

system cpu time  0.57 seconds

 

proc sql;

delete from HAVE5 where I in(select V from FILTER);

quit;

%* Base Engine, sql delete+subquery+index;

real time      26.33 seconds

user cpu time  23.68 seconds

system cpu time 1.26 seconds

 

proc sql nowarnrecurs;

create table HAVE6 as

select * from HAVE6 where I not in(select V from FILTER);

quit;

%* Base Engine, sql create+subquery+index;

real time      11.81 seconds

user cpu time  11.21 seconds

system cpu time 0.57 seconds

 

data SPEEDY.HAVE1;

set SPEEDY.HAVE1;

where I not in(&filter);

run;

%* SPDE Engine, data step;

real time      10:16.12

user cpu time  27:33.64

system cpu time    0.29 seconds

 

proc sql;

delete from SPEEDY.HAVE2

where I in(&filter);

quit;

%* SPDE Engine, sql delete + list;

 real time      10:08.50

user cpu time   27:17.57

system cpu time        0.62 seconds

 

proc sql;

delete from SPEEDY.HAVE3

where I in(select V from FILTER);

quit;

%* SPDE Engine, sql delete + subquery;

real time       19.46 seconds

user cpu time   20.07 seconds

system cpu time  0.34 seconds

 

proc sql nowarnrecurs;

create table SPEEDY.HAVE4 as select * from SPEEDY.HAVE4 where I not in(select V from FILTER);

quit;

%* SPDE Engine, sql create + subquery;

real time       12.16 seconds

user cpu time   12.97 seconds

system cpu time  0.20 seconds

 

proc sql;

delete from SPEEDY.HAVE5

where I in(select V from FILTER);

quit;

%* SPDE Engine, sql delete+subquery+index;

real time       19.61 seconds

user cpu time   19.82 seconds

system cpu time  0.68 seconds

 

proc sql nowarnrecurs;

create table SPEEDY.HAVE6 as select * from SPEEDY.HAVE6 where I not in(select V from FILTER);

quit;

%* SPDE Engine, sql create+subquery+index;

real time       12.38 seconds

user cpu time   12.76 seconds

system cpu time  0.51 seconds

 

 

 

 

 

 

 

 

 

Super User
Posts: 5,429

Re: Deleting matching records from a huge SAS Dataset

Great sum up @ChrisNZ!

And I have the same experience of using delete from on a SPDE table, it really is no good. Recreating using where were much faster Smiley Sad

Data never sleeps
Super User
Posts: 10,028

Re: Deleting matching records from a huge SAS Dataset

Posted in reply to KiranMaddi

It sound like it is Hash Table thing or PROC FORMAT thing . Don't know how big the transaction table is ? if it is not too big , I suggest to use Hash Table.

Super User
Posts: 5,429

Re: Deleting matching records from a huge SAS Dataset

@Ksharp - Hash table, how would that speed up this operation at all? The aim is to find a way not to rewrite the whole table, hash tables are for look-up, not indexed updates....?

Data never sleeps
Frequent Contributor
Posts: 107

Re: Deleting matching records from a huge SAS Dataset

Posted in reply to KiranMaddi

Hi all

 

Thanks for all your suggestions and sorry for the delayed response. I have over estimated the record count. Currently the dataset has only 35 million records but it gets updated with around 35k records every day. So we want to delete any matching records in the big table and replace it with the record from the transaction table.

 

I use SAS 9.3 EG 5.1. We currently don't have the SPD server.

Super User
Posts: 5,429

Re: Deleting matching records from a huge SAS Dataset

Posted in reply to KiranMaddi

Oh, that's a totally different story!

So "name" is your key?

Create unique index on name.

Then use data step with modify by, that will overwrite any matching rows.

It doesn't sounds like that you need to do separate deletes.

With this kind of operation, using SPDE could be beneficial - but should be fairly easy to test given the data volumes now mentioned.

Data never sleeps
Frequent Contributor
Posts: 107

Re: Deleting matching records from a huge SAS Dataset

Hi LinusH. Thanks for your reply. I also forgot to mention that if there are any new records in the transaction table they should be copied over to the master table. In this case I don't think modify works?
Super User
Posts: 5,429

Re: Deleting matching records from a huge SAS Dataset

Posted in reply to KiranMaddi
Of course they will! 😊
Check doc dor syntax and examples.
Data never sleeps
Trusted Advisor
Posts: 1,117

Re: Deleting matching records from a huge SAS Dataset

Posted in reply to KiranMaddi

Yesterday (i.e. based on the initial post), I tried to delete 10,000 out of 10 million obs. (with only one variable) using a data step of the form

data master;
modify master transact;
by i;
remove;
run;

without an index, though.

 

Result: After several minutes I cancelled the submitted statements, because the step seemed to run forever. When I stopped it, only less than 2000 obs. had been deleted.

 

In contrast, an ordinary MERGE statement with subsetting IF took only 1 - 1.5 seconds. A SET statement with WHERE condition using a format (based on TRANSACT) took about 3 seconds.

PROC Star
Posts: 1,760

Re: Deleting matching records from a huge SAS Dataset

[ Edited ]
Posted in reply to FreelanceReinhard

Yes merging on a sorted table is quite efficient. Modify has always disappointed me.

 

Have you tried the update statement rather than the modify statement?

Trusted Advisor
Posts: 1,117

Re: Deleting matching records from a huge SAS Dataset

[ Edited ]

With both observation numbers multiplied by 10 (i.e. 100M and 100k) and one "data" variable ($1) added, the UPDATE statement was almost as fast as the MERGE statement (approx. 11 s vs. 10 s).

 

Interestingly, it seemed to make no difference whether the datasets (1.5 GB, 1.6 MB) were stored on a hard disk or on a RAM disk/SSD.

Ask a Question
Discussion stats
  • 14 replies
  • 723 views
  • 5 likes
  • 6 in conversation