DATA Step, Macro, Functions and more

Delete rows quickly from table containing 200 million observations

Reply
Occasional Contributor
Posts: 7

Delete rows quickly from table containing 200 million observations

[ Edited ]

Hi,

 

I am trying to delete rows from one table based on another table. The table from where I want to delete rows has about 200 million records and is growing everyday. This is a customer dimension mart which tracks changes on customer attributes. 

 

As part of the ETL process, I am deleting changed (current) records and appending changed& new records. The whole process is quick except for the deletion process as its my query which I believe may not be optimized.

 

I do this currently :

 

proc sql;
delete * from customer_history where (select distinct customerno from x) and current_flag='Y';

 

Delete process takes more than 2 hours which defeats the purpose.

 

 

X has about 800,000 to 1 million customerno generally. X is indexed on customerno. History table is indexed on customer, current_flag and a composite one having both.History has about 200m records.

 

I understand the process I run currently has to traverse through all 200m records and apply the filter as part of the delete process.

 

I can probably modify the ETL a bit to do this differently but before I do that I want to see if I am doing anything fundamentally wrong here?


Any help is much appreciated. 

Thanks.

PROC Star
Posts: 1,215

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn

What do customer_history and x look like? How many columns are there and are they structured the same way?

Occasional Contributor
Posts: 7

Re: Delete rows quickly from table containing 200 million observations

Thanks for your reply.


No, they are currently not in the same structure. The history one has all customer attributes name,dob, address etc about 100 columns. the second table just tells me the customer numbers which have changed data in any of the 100 columns compared to the history. 

 

This is then processed to mark old record as expired and set the new records as current for each customerno.

 

Short answer the second one just has customerno and is indexed. I can get it in the same structure if it helps.

I don't want to create a new dataset ideally unless unavoidable. I am trying to avoid I/O but introduced heavy I/O with delete process Smiley Sad

Super User
Posts: 9,922

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn

I'd keep both tables sorted by customerno. X should be sorted with nodupkey.

Then merge:

data customer_history_new;
merge
  customer_history (in=a)
  x (in=b keep=customerno)
;
by customerno;
if a;
if not b or current_flag ne 'Y';
run; 

Once that step is through, remove the old customer_history and rename the new one.

Indexes are not helpful here in terms of performance because you deal with at least a large subset of both tables; in such cases the indexes just cause extra I/O and reduce performance overall. Indexes are only good when you need to select less than 10% (rule of thumb) of a dataset in a step.

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Occasional Contributor
Posts: 7

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to KurtBremser

Thank you for your reply.

 

I did forget the basic idea about using indexes with subset of data. thanks for reminding me about it.


I know you say index is an overhead in this case. But because both datasets are indexed on customerno, can I get away without sorting these 2 tables and do this merge right away? I thought index(logically) is a variable sorted with no duplicate entries. so I would imagine i don't have to sort. Please confirm if this wrong for what i am trying to do.

Thanks.

Super User
Posts: 9,922

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn

The index will give you all matching observations of a dataset, it does not do a nodupkey on its own. So you need to take care of that and forcibly remove all duplicates where they might cause harm.

When dealing with large datasets, it's always an advantage to store them sorted by the column(s) primarily used for access.

In your case that means that keeping everything ordered by customerno will be a real booster.

This is the technique I use for all my datasets.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Occasional Contributor
Posts: 7

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to KurtBremser

Thanks. I tried this , sorted x by customerno with nodupkey. 
The process ran for over an hour and I had to kill it Smiley Sad

Super User
Posts: 9,922

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn

srihari_v_tcs_com wrote:

Thanks. I tried this , sorted x by customerno with nodupkey. 
The process ran for over an hour and I had to kill it Smiley Sad


Sorting a dataset with 1 million records and one variable should be done in less than a minute on any reasonably current computer. There's something seriously amiss here.

Do you work on networked drives?

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Super User
Posts: 10,691

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn
data class;
 set sashelp.class;
run;
data class;
modify class;
if sex='F' then remove;
run;

Be careful, the table class might got corrupted .

Occasional Contributor
Posts: 7

Re: Delete rows quickly from table containing 200 million observations

Thanks. 

Yes we have had corruption issues with the SCD load transformation in DI, so trying to move away from that to a simpler solution.

Super User
Posts: 10,691

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn
data class;
 set sashelp.class;
run;


proc iml;
edit class;
delete all where(sex='F');
purge;
quit;

Or if you have IML module .

PROC Star
Posts: 253

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn

The problem may be that the SQL does the following, exactly because you have an index on the history table:

  1. Unique customer numbers are extracted from X. OK, takes very little time as X is small.
  2. For each customer number in X, the corresponding record is looked up using the composite index. Would be the fastest method with just a few (or a few hundred) records, but not with a million (if the history table is not sorted by the index variables, the lookup would be jumping from one disk segment to another all the time, meaning that the whole table may be read from disk many times over).

If you do things with a datastep, you can turn the process on its head, so that customer numbers are looked up in the small table instead, e.g.:

data customer_history;
  modify costumer_history;
  where current_flag='Y';
  set x(keep=customerno) key=customerno/unique;
  if _iorc_ then
    _error_=0; /* not found, reset _error_ to avoid error message */
  else 
    remove; /* delete the record from history */
run;

The idea is that the X table index hopefully is small enough to fit completely in memory. An alternative is to load the X table to a hash table, which executes slightly faster than the SET with KEY=, but on the other hand takes time to load. As I assume that you are only deleting one record for each value in X, I think the solution shown above is the fastest when X is already indexed.

Occasional Contributor
Posts: 7

Re: Delete rows quickly from table containing 200 million observations

Thanks for your help.

 

I tried this. The first time it ran for about 45 min and did the job. Re-ran it and it went into a state where nothing happened and I had to kill the session.

 

After this, I thought I will sort the big (200m) dataset and try your method.

Sorting took about 25 min (off-peak hours). Delete then took just 6 mins !!!

 

I have checked this for multiple days and the behaviour is consistent.

I didn't know sorting a dataset would make that much of a difference !

 

Now trying to figure out if I can sort this dataset quickly! I guess I'm being greedy now ? Smiley Wink

 

Thanks to all of you for your help. 

Super User
Posts: 9,922

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn

25 minutes is not a big thing for that kind of data. Balance it with the performance improvement you get.

 

If you want to speed up the operation, first diagnose the process. Are your CPU's loaded to the brim while the sort is running, or do you have lots of I/O waits? I suspect the latter, in which case you need to start optimizing your storage.

Get a clear picture of your storage infrastructure.

Consider switching to SSD drives, and putting your logical volumes on stripesets of physical "disks", thereby spreading the physical load on several devices.

Keep permanent storage, WORK and UTILLOC on separate volumes, and also make sure that they are physically separate (if they are not on a SAN, in which case you need to work with your SAN admins). Avoid working on a network share by all means.

 

If you're CPU-bound, get faster or more processors.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Super User
Posts: 9,922

Re: Delete rows quickly from table containing 200 million observations

Posted in reply to sriharivn

Depending on your available memory, you could avoid the join altogether by creating a lookup format from dataset X:

proc sort
  data=x (keep=customerno)
  out=cntlin (rename=(customerno=start))
  nodupkey
;
by customerno;
run;

data cntlin;
set cntlin end=done;
fmtname = 'checkfmt';
type = 'C';
label = 'yes';
hlo = ' ';
output;
if done
then do;
  hlo = 'O';
  start = '**OTHER**';
  label = 'no';
  output;
end;

proc format library=work cntlin=cntlin;
run;

data customer_history_new;
set customer_history;
if current ne 'Y' or put(customer_no,$checkfmt.) = 'no';
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Ask a Question
Discussion stats
  • 17 replies
  • 281 views
  • 5 likes
  • 6 in conversation