Re: Delete rows quickly from table containing 200 million observations

sriharivn · Posted 01-10-2018 06:11 AM

Hi,

I am trying to delete rows from one table based on another table. The table from where I want to delete rows has about 200 million records and is growing everyday. This is a customer dimension mart which tracks changes on customer attributes.

As part of the ETL process, I am deleting changed (current) records and appending changed& new records. The whole process is quick except for the deletion process as its my query which I believe may not be optimized.

I do this currently :

proc sql;
delete * from customer_history where (select distinct customerno from x) and current_flag='Y';

Delete process takes more than 2 hours which defeats the purpose.

X has about 800,000 to 1 million customerno generally. X is indexed on customerno. History table is indexed on customer, current_flag and a composite one having both.History has about 200m records.

I understand the process I run currently has to traverse through all 200m records and apply the filter as part of the delete process.

I can probably modify the ETL a bit to do this differently but before I do that I want to see if I am doing anything fundamentally wrong here?

Any help is much appreciated.

Thanks.

PeterClemmensen · Posted 01-10-2018 06:18 AM

What do customer_history and x look like? How many columns are there and are they structured the same way?

The DATA to DATA Step Macro
Blog: SASnrd

sriharivn · Posted 01-10-2018 06:25 AM

Thanks for your reply.

No, they are currently not in the same structure. The history one has all customer attributes name,dob, address etc about 100 columns. the second table just tells me the customer numbers which have changed data in any of the 100 columns compared to the history.

This is then processed to mark old record as expired and set the new records as current for each customerno.

Short answer the second one just has customerno and is indexed. I can get it in the same structure if it helps.

I don't want to create a new dataset ideally unless unavoidable. I am trying to avoid I/O but introduced heavy I/O with delete process 😞

Kurt_Bremser · Posted 01-10-2018 07:05 AM

I'd keep both tables sorted by customerno. X should be sorted with nodupkey.

Then merge:

data customer_history_new;
merge
  customer_history (in=a)
  x (in=b keep=customerno)
;
by customerno;
if a;
if not b or current_flag ne 'Y';
run;

Once that step is through, remove the old customer_history and rename the new one.

Indexes are not helpful here in terms of performance because you deal with at least a large subset of both tables; in such cases the indexes just cause extra I/O and reduce performance overall. Indexes are only good when you need to select less than 10% (rule of thumb) of a dataset in a step.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

sriharivn · Posted 01-10-2018 07:59 AM

Thank you for your reply.

I did forget the basic idea about using indexes with subset of data. thanks for reminding me about it.

I know you say index is an overhead in this case. But because both datasets are indexed on customerno, can I get away without sorting these 2 tables and do this merge right away? I thought index(logically) is a variable sorted with no duplicate entries. so I would imagine i don't have to sort. Please confirm if this wrong for what i am trying to do.

Thanks.

Kurt_Bremser · Posted 01-10-2018 08:10 AM

The index will give you all matching observations of a dataset, it does not do a nodupkey on its own. So you need to take care of that and forcibly remove all duplicates where they might cause harm.

When dealing with large datasets, it's always an advantage to store them sorted by the column(s) primarily used for access.

In your case that means that keeping everything ordered by customerno will be a real booster.

This is the technique I use for all my datasets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

sriharivn · Posted 01-10-2018 10:39 AM

Thanks. I tried this , sorted x by customerno with nodupkey.
The process ran for over an hour and I had to kill it 😞

Kurt_Bremser · Posted 01-11-2018 01:31 AM

@Srihari_v_tcs_com wrote:

Thanks. I tried this , sorted x by customerno with nodupkey.
The process ran for over an hour and I had to kill it 😞

Sorting a dataset with 1 million records and one variable should be done in less than a minute on any reasonably current computer. There's something seriously amiss here.

Do you work on networked drives?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 01-10-2018 07:38 AM

data class;
 set sashelp.class;
run;
data class;
modify class;
if sex='F' then remove;
run;

Be careful, the table class might got corrupted .

sriharivn · Posted 01-10-2018 08:01 AM

Thanks.

Yes we have had corruption issues with the SCD load transformation in DI, so trying to move away from that to a simpler solution.

Ksharp · Posted 01-10-2018 08:08 AM

data class;
 set sashelp.class;
run;


proc iml;
edit class;
delete all where(sex='F');
purge;
quit;

Or if you have IML module .

s_lassen · Posted 01-10-2018 08:47 AM

The problem may be that the SQL does the following, exactly because you have an index on the history table:

Unique customer numbers are extracted from X. OK, takes very little time as X is small.
For each customer number in X, the corresponding record is looked up using the composite index. Would be the fastest method with just a few (or a few hundred) records, but not with a million (if the history table is not sorted by the index variables, the lookup would be jumping from one disk segment to another all the time, meaning that the whole table may be read from disk many times over).

If you do things with a datastep, you can turn the process on its head, so that customer numbers are looked up in the small table instead, e.g.:

data customer_history;
  modify costumer_history;
  where current_flag='Y';
  set x(keep=customerno) key=customerno/unique;
  if _iorc_ then
    _error_=0; /* not found, reset _error_ to avoid error message */
  else 
    remove; /* delete the record from history */
run;

The idea is that the X table index hopefully is small enough to fit completely in memory. An alternative is to load the X table to a hash table, which executes slightly faster than the SET with KEY=, but on the other hand takes time to load. As I assume that you are only deleting one record for each value in X, I think the solution shown above is the fastest when X is already indexed.

sriharivn · Posted 01-15-2018 04:14 AM

Thanks for your help.

I tried this. The first time it ran for about 45 min and did the job. Re-ran it and it went into a state where nothing happened and I had to kill the session.

After this, I thought I will sort the big (200m) dataset and try your method.

Sorting took about 25 min (off-peak hours). Delete then took just 6 mins !!!

I have checked this for multiple days and the behaviour is consistent.

I didn't know sorting a dataset would make that much of a difference !

Now trying to figure out if I can sort this dataset quickly! I guess I'm being greedy now ? 😉

Thanks to all of you for your help.

Kurt_Bremser · Posted 01-15-2018 04:43 AM

25 minutes is not a big thing for that kind of data. Balance it with the performance improvement you get.

If you want to speed up the operation, first diagnose the process. Are your CPU's loaded to the brim while the sort is running, or do you have lots of I/O waits? I suspect the latter, in which case you need to start optimizing your storage.

Get a clear picture of your storage infrastructure.

Consider switching to SSD drives, and putting your logical volumes on stripesets of physical "disks", thereby spreading the physical load on several devices.

Keep permanent storage, WORK and UTILLOC on separate volumes, and also make sure that they are physically separate (if they are not on a SAN, in which case you need to work with your SAN admins). Avoid working on a network share by all means.

If you're CPU-bound, get faster or more processors.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 01-10-2018 09:10 AM

Depending on your available memory, you could avoid the join altogether by creating a lookup format from dataset X:

proc sort
  data=x (keep=customerno)
  out=cntlin (rename=(customerno=start))
  nodupkey
;
by customerno;
run;

data cntlin;
set cntlin end=done;
fmtname = 'checkfmt';
type = 'C';
label = 'yes';
hlo = ' ';
output;
if done
then do;
  hlo = 'O';
  start = '**OTHER**';
  label = 'no';
  output;
end;

proc format library=work cntlin=cntlin;
run;

data customer_history_new;
set customer_history;
if current ne 'Y' or put(customer_no,$checkfmt.) = 'no';
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Classroom Training Available!