topic Re: Delete rows quickly from table containing 200 million observations in SAS Programming

Delete rows quickly from table containing 200 million observations

sriharivn — Wed, 10 Jan 2018 11:26:41 GMT

Hi,

I am trying to delete rows from one table based on another table. The table from where I want to delete rows has about 200 million records and is growing everyday. This is a customer dimension mart which tracks changes on customer attributes.

As part of the ETL process, I am deleting changed (current) records and appending changed& new records. The whole process is quick except for the deletion process as its my query which I believe may not be optimized.

I do this currently :

proc sql;
delete * from customer_history where (select distinct customerno from x) and current_flag='Y';

Delete process takes more than 2 hours which defeats the purpose.

X has about 800,000 to 1 million customerno generally. X is indexed on customerno. History table is indexed on customer, current_flag and a composite one having both.History has about 200m records.

I understand the process I run currently has to traverse through all 200m records and apply the filter as part of the delete process.

I can probably modify the ETL a bit to do this differently but before I do that I want to see if I am doing anything fundamentally wrong here?

Any help is much appreciated.

Thanks.

Re: Delete rows quickly from table containing 200 million observations

PeterClemmensen — Wed, 10 Jan 2018 11:18:42 GMT

What do customer_history and x look like? How many columns are there and are they structured the same way?

Re: Delete rows quickly from table containing 200 million observations

sriharivn — Wed, 10 Jan 2018 11:25:17 GMT

Thanks for your reply.

No, they are currently not in the same structure. The history one has all customer attributes name,dob, address etc about 100 columns. the second table just tells me the customer numbers which have changed data in any of the 100 columns compared to the history.

This is then processed to mark old record as expired and set the new records as current for each customerno.

Short answer the second one just has customerno and is indexed. I can get it in the same structure if it helps.

I don't want to create a new dataset ideally unless unavoidable. I am trying to avoid I/O but introduced heavy I/O with delete process 😞

Re: Delete rows quickly from table containing 200 million observations

Kurt_Bremser — Wed, 10 Jan 2018 12:05:27 GMT

I'd keep both tables sorted by customerno. X should be sorted with nodupkey.

Then merge:

data customer_history_new;
merge
  customer_history (in=a)
  x (in=b keep=customerno)
;
by customerno;
if a;
if not b or current_flag ne 'Y';
run;

Once that step is through, remove the old customer_history and rename the new one.

Indexes are not helpful here in terms of performance because you deal with at least a large subset of both tables; in such cases the indexes just cause extra I/O and reduce performance overall. Indexes are only good when you need to select less than 10% (rule of thumb) of a dataset in a step.

Re: Delete rows quickly from table containing 200 million observations

Ksharp — Wed, 10 Jan 2018 12:38:42 GMT

data class;
 set sashelp.class;
run;
data class;
modify class;
if sex='F' then remove;
run;

Be careful, the table class might got corrupted .

Re: Delete rows quickly from table containing 200 million observations

sriharivn — Wed, 10 Jan 2018 12:59:57 GMT

Thank you for your reply.

I did forget the basic idea about using indexes with subset of data. thanks for reminding me about it.

I know you say index is an overhead in this case. But because both datasets are indexed on customerno, can I get away without sorting these 2 tables and do this merge right away? I thought index(logically) is a variable sorted with no duplicate entries. so I would imagine i don't have to sort. Please confirm if this wrong for what i am trying to do.

Thanks.

Re: Delete rows quickly from table containing 200 million observations

sriharivn — Wed, 10 Jan 2018 13:01:11 GMT

Thanks.

Yes we have had corruption issues with the SCD load transformation in DI, so trying to move away from that to a simpler solution.

Re: Delete rows quickly from table containing 200 million observations

Ksharp — Wed, 10 Jan 2018 13:08:09 GMT

data class;
 set sashelp.class;
run;


proc iml;
edit class;
delete all where(sex='F');
purge;
quit;

Or if you have IML module .

Re: Delete rows quickly from table containing 200 million observations

Kurt_Bremser — Wed, 10 Jan 2018 13:10:30 GMT

The index will give you all matching observations of a dataset, it does not do a nodupkey on its own. So you need to take care of that and forcibly remove all duplicates where they might cause harm.

When dealing with large datasets, it's always an advantage to store them sorted by the column(s) primarily used for access.

In your case that means that keeping everything ordered by customerno will be a real booster.

This is the technique I use for all my datasets.

Re: Delete rows quickly from table containing 200 million observations

s_lassen — Wed, 10 Jan 2018 13:47:08 GMT

The problem may be that the SQL does the following, exactly because you have an index on the history table:

Unique customer numbers are extracted from X. OK, takes very little time as X is small.
For each customer number in X, the corresponding record is looked up using the composite index. Would be the fastest method with just a few (or a few hundred) records, but not with a million (if the history table is not sorted by the index variables, the lookup would be jumping from one disk segment to another all the time, meaning that the whole table may be read from disk many times over).

If you do things with a datastep, you can turn the process on its head, so that customer numbers are looked up in the small table instead, e.g.:

data customer_history;
  modify costumer_history;
  where current_flag='Y';
  set x(keep=customerno) key=customerno/unique;
  if _iorc_ then
    _error_=0; /* not found, reset _error_ to avoid error message */
  else 
    remove; /* delete the record from history */
run;

The idea is that the X table index hopefully is small enough to fit completely in memory. An alternative is to load the X table to a hash table, which executes slightly faster than the SET with KEY=, but on the other hand takes time to load. As I assume that you are only deleting one record for each value in X, I think the solution shown above is the fastest when X is already indexed.

Re: Delete rows quickly from table containing 200 million observations

Kurt_Bremser — Wed, 10 Jan 2018 14:10:39 GMT

Depending on your available memory, you could avoid the join altogether by creating a lookup format from dataset X:

proc sort
  data=x (keep=customerno)
  out=cntlin (rename=(customerno=start))
  nodupkey
;
by customerno;
run;

data cntlin;
set cntlin end=done;
fmtname = 'checkfmt';
type = 'C';
label = 'yes';
hlo = ' ';
output;
if done
then do;
  hlo = 'O';
  start = '**OTHER**';
  label = 'no';
  output;
end;

proc format library=work cntlin=cntlin;
run;

data customer_history_new;
set customer_history;
if current ne 'Y' or put(customer_no,$checkfmt.) = 'no';
run;

Re: Delete rows quickly from table containing 200 million observations

sriharivn — Wed, 10 Jan 2018 15:39:27 GMT

Thanks. I tried this , sorted x by customerno with nodupkey.
The process ran for over an hour and I had to kill it 😞

Re: Delete rows quickly from table containing 200 million observations

sriharivn — Wed, 10 Jan 2018 15:47:58 GMT

Thanks Kurt. One clarification on the last step.

I want to keep all the old records for the customer (flag='N') and delete only the Ys in the history for the customers in X. I want to keep the Ys for the other records which are not present in X.

data customer_history_new;
set customer_history;
if current ne 'Y' or put(customer_no,$checkfmt.) = 'no';
run;

I guess this will remove all the Ys in the history and give me only Ns and also remove the X customers.

Is my understanding right?

Re: Delete rows quickly from table containing 200 million observations

Kurt_Bremser — Thu, 11 Jan 2018 06:28:47 GMT

Yes. The condition is a DeMorgan transformation of your delete condition.

Re: Delete rows quickly from table containing 200 million observations

Kurt_Bremser — Thu, 11 Jan 2018 06:31:23 GMT

@Srihari_v_tcs_com wrote:

Thanks. I tried this , sorted x by customerno with nodupkey.
The process ran for over an hour and I had to kill it 😞

Sorting a dataset with 1 million records and one variable should be done in less than a minute on any reasonably current computer. There's something seriously amiss here.

Do you work on networked drives?

Re: Delete rows quickly from table containing 200 million observations

sriharivn — Mon, 15 Jan 2018 09:14:21 GMT

Thanks for your help.

I tried this. The first time it ran for about 45 min and did the job. Re-ran it and it went into a state where nothing happened and I had to kill the session.

After this, I thought I will sort the big (200m) dataset and try your method.

Sorting took about 25 min (off-peak hours). Delete then took just 6 mins !!!

I have checked this for multiple days and the behaviour is consistent.

I didn't know sorting a dataset would make that much of a difference !

Now trying to figure out if I can sort this dataset quickly! I guess I'm being greedy now ? 😉

Thanks to all of you for your help.

Re: Delete rows quickly from table containing 200 million observations

Kurt_Bremser — Mon, 15 Jan 2018 09:43:23 GMT

25 minutes is not a big thing for that kind of data. Balance it with the performance improvement you get.

If you want to speed up the operation, first diagnose the process. Are your CPU's loaded to the brim while the sort is running, or do you have lots of I/O waits? I suspect the latter, in which case you need to start optimizing your storage.

Get a clear picture of your storage infrastructure.

Consider switching to SSD drives, and putting your logical volumes on stripesets of physical "disks", thereby spreading the physical load on several devices.

Keep permanent storage, WORK and UTILLOC on separate volumes, and also make sure that they are physically separate (if they are not on a SAN, in which case you need to work with your SAN admins). Avoid working on a network share by all means.

If you're CPU-bound, get faster or more processors.

Re: Delete rows quickly from table containing 200 million observations

ChrisNZ — Tue, 16 Jan 2018 05:09:34 GMT

1- Formats can be slow for large numbers of values in my experience.

2- Keep in mind that a sorted table will be fast not only for maintenance for you as an admin, but also for all users.

3- Rewriting sorted tables is just a matter of copying the table, so if you have the space it should not take long considering the volume you have.

4- If you are I/O bound, using SPDE with binary compression instead of the V9 engine is a must as it will typically decreases table size by 90%. This engine also processes indexes much faster.

5- When doing such updates, I like to keep tabs of what happens in the log and usually use something like:

data NEWHISTORY (drop=_:); 
  if LASTOBS then put _REPLACED= _ADDED= _KEPT=; 
  merge HISTORY (in=OLD)
        NEW     (in=NEW)
        end=LASTOBS;
  by CUSTOMERNO CURRENT_FLAG;
  if OLD and NEW then _REPLACED+1;
  else if OLD then    _KEPT    +1;
  else                _ADDED   +1;
run;

proc sort data=NEWHISTORY 
          out =HISTORY(index=( ... )) 
          presorted;
  by CUSTOMERNO CURRENT_FLAG ;
run;

Add the deletion logic as you see fit for your data (like: if DEL then do; _DELETE+1; delete; end; )

The sort step is there just because SAS have been sitting on their hands and there is still no way to set the SORT VALIDATED flag in the data step. PROC SORT just copies the table here.

Index creation is fast on sorted tables.