01-10-2018 06:11 AM - edited 01-10-2018 06:26 AM
I am trying to delete rows from one table based on another table. The table from where I want to delete rows has about 200 million records and is growing everyday. This is a customer dimension mart which tracks changes on customer attributes.
As part of the ETL process, I am deleting changed (current) records and appending changed& new records. The whole process is quick except for the deletion process as its my query which I believe may not be optimized.
I do this currently :
delete * from customer_history where (select distinct customerno from x) and current_flag='Y';
Delete process takes more than 2 hours which defeats the purpose.
X has about 800,000 to 1 million customerno generally. X is indexed on customerno. History table is indexed on customer, current_flag and a composite one having both.History has about 200m records.
I understand the process I run currently has to traverse through all 200m records and apply the filter as part of the delete process.
I can probably modify the ETL a bit to do this differently but before I do that I want to see if I am doing anything fundamentally wrong here?
Any help is much appreciated.
01-10-2018 06:25 AM
Thanks for your reply.
No, they are currently not in the same structure. The history one has all customer attributes name,dob, address etc about 100 columns. the second table just tells me the customer numbers which have changed data in any of the 100 columns compared to the history.
This is then processed to mark old record as expired and set the new records as current for each customerno.
Short answer the second one just has customerno and is indexed. I can get it in the same structure if it helps.
I don't want to create a new dataset ideally unless unavoidable. I am trying to avoid I/O but introduced heavy I/O with delete process
01-10-2018 07:05 AM
I'd keep both tables sorted by customerno. X should be sorted with nodupkey.
data customer_history_new; merge customer_history (in=a) x (in=b keep=customerno) ; by customerno; if a; if not b or current_flag ne 'Y'; run;
Once that step is through, remove the old customer_history and rename the new one.
Indexes are not helpful here in terms of performance because you deal with at least a large subset of both tables; in such cases the indexes just cause extra I/O and reduce performance overall. Indexes are only good when you need to select less than 10% (rule of thumb) of a dataset in a step.
01-10-2018 07:59 AM
Thank you for your reply.
I did forget the basic idea about using indexes with subset of data. thanks for reminding me about it.
I know you say index is an overhead in this case. But because both datasets are indexed on customerno, can I get away without sorting these 2 tables and do this merge right away? I thought index(logically) is a variable sorted with no duplicate entries. so I would imagine i don't have to sort. Please confirm if this wrong for what i am trying to do.
01-10-2018 08:10 AM
The index will give you all matching observations of a dataset, it does not do a nodupkey on its own. So you need to take care of that and forcibly remove all duplicates where they might cause harm.
When dealing with large datasets, it's always an advantage to store them sorted by the column(s) primarily used for access.
In your case that means that keeping everything ordered by customerno will be a real booster.
This is the technique I use for all my datasets.
01-11-2018 01:31 AM
Thanks. I tried this , sorted x by customerno with nodupkey.
The process ran for over an hour and I had to kill it
Sorting a dataset with 1 million records and one variable should be done in less than a minute on any reasonably current computer. There's something seriously amiss here.
Do you work on networked drives?
01-10-2018 07:38 AM
data class; set sashelp.class; run; data class; modify class; if sex='F' then remove; run;
Be careful, the table class might got corrupted .
01-10-2018 08:01 AM
Yes we have had corruption issues with the SCD load transformation in DI, so trying to move away from that to a simpler solution.
01-10-2018 08:08 AM
data class; set sashelp.class; run; proc iml; edit class; delete all where(sex='F'); purge; quit;
Or if you have IML module .
01-10-2018 08:47 AM
The problem may be that the SQL does the following, exactly because you have an index on the history table:
If you do things with a datastep, you can turn the process on its head, so that customer numbers are looked up in the small table instead, e.g.:
data customer_history; modify costumer_history; where current_flag='Y'; set x(keep=customerno) key=customerno/unique; if _iorc_ then _error_=0; /* not found, reset _error_ to avoid error message */ else remove; /* delete the record from history */ run;
The idea is that the X table index hopefully is small enough to fit completely in memory. An alternative is to load the X table to a hash table, which executes slightly faster than the SET with KEY=, but on the other hand takes time to load. As I assume that you are only deleting one record for each value in X, I think the solution shown above is the fastest when X is already indexed.
01-15-2018 04:14 AM
Thanks for your help.
I tried this. The first time it ran for about 45 min and did the job. Re-ran it and it went into a state where nothing happened and I had to kill the session.
After this, I thought I will sort the big (200m) dataset and try your method.
Sorting took about 25 min (off-peak hours). Delete then took just 6 mins !!!
I have checked this for multiple days and the behaviour is consistent.
I didn't know sorting a dataset would make that much of a difference !
Now trying to figure out if I can sort this dataset quickly! I guess I'm being greedy now ?
Thanks to all of you for your help.
01-15-2018 04:43 AM
25 minutes is not a big thing for that kind of data. Balance it with the performance improvement you get.
If you want to speed up the operation, first diagnose the process. Are your CPU's loaded to the brim while the sort is running, or do you have lots of I/O waits? I suspect the latter, in which case you need to start optimizing your storage.
Get a clear picture of your storage infrastructure.
Consider switching to SSD drives, and putting your logical volumes on stripesets of physical "disks", thereby spreading the physical load on several devices.
Keep permanent storage, WORK and UTILLOC on separate volumes, and also make sure that they are physically separate (if they are not on a SAN, in which case you need to work with your SAN admins). Avoid working on a network share by all means.
If you're CPU-bound, get faster or more processors.
01-10-2018 09:10 AM
Depending on your available memory, you could avoid the join altogether by creating a lookup format from dataset X:
proc sort data=x (keep=customerno) out=cntlin (rename=(customerno=start)) nodupkey ; by customerno; run; data cntlin; set cntlin end=done; fmtname = 'checkfmt'; type = 'C'; label = 'yes'; hlo = ' '; output; if done then do; hlo = 'O'; start = '**OTHER**'; label = 'no'; output; end; proc format library=work cntlin=cntlin; run; data customer_history_new; set customer_history; if current ne 'Y' or put(customer_no,$checkfmt.) = 'no'; run;