BookmarkSubscribeRSS Feed
sriharivn
Fluorite | Level 6

Hi,

 

I am trying to delete rows from one table based on another table. The table from where I want to delete rows has about 200 million records and is growing everyday. This is a customer dimension mart which tracks changes on customer attributes. 

 

As part of the ETL process, I am deleting changed (current) records and appending changed& new records. The whole process is quick except for the deletion process as its my query which I believe may not be optimized.

 

I do this currently :

 

proc sql;
delete * from customer_history where (select distinct customerno from x) and current_flag='Y';

 

Delete process takes more than 2 hours which defeats the purpose.

 

 

X has about 800,000 to 1 million customerno generally. X is indexed on customerno. History table is indexed on customer, current_flag and a composite one having both.History has about 200m records.

 

I understand the process I run currently has to traverse through all 200m records and apply the filter as part of the delete process.

 

I can probably modify the ETL a bit to do this differently but before I do that I want to see if I am doing anything fundamentally wrong here?


Any help is much appreciated. 

Thanks.

17 REPLIES 17
PeterClemmensen
Tourmaline | Level 20

What do customer_history and x look like? How many columns are there and are they structured the same way?

sriharivn
Fluorite | Level 6

Thanks for your reply.


No, they are currently not in the same structure. The history one has all customer attributes name,dob, address etc about 100 columns. the second table just tells me the customer numbers which have changed data in any of the 100 columns compared to the history. 

 

This is then processed to mark old record as expired and set the new records as current for each customerno.

 

Short answer the second one just has customerno and is indexed. I can get it in the same structure if it helps.

I don't want to create a new dataset ideally unless unavoidable. I am trying to avoid I/O but introduced heavy I/O with delete process 😞

Kurt_Bremser
Super User

I'd keep both tables sorted by customerno. X should be sorted with nodupkey.

Then merge:

data customer_history_new;
merge
  customer_history (in=a)
  x (in=b keep=customerno)
;
by customerno;
if a;
if not b or current_flag ne 'Y';
run; 

Once that step is through, remove the old customer_history and rename the new one.

Indexes are not helpful here in terms of performance because you deal with at least a large subset of both tables; in such cases the indexes just cause extra I/O and reduce performance overall. Indexes are only good when you need to select less than 10% (rule of thumb) of a dataset in a step.

 

sriharivn
Fluorite | Level 6

Thank you for your reply.

 

I did forget the basic idea about using indexes with subset of data. thanks for reminding me about it.


I know you say index is an overhead in this case. But because both datasets are indexed on customerno, can I get away without sorting these 2 tables and do this merge right away? I thought index(logically) is a variable sorted with no duplicate entries. so I would imagine i don't have to sort. Please confirm if this wrong for what i am trying to do.

Thanks.

Kurt_Bremser
Super User

The index will give you all matching observations of a dataset, it does not do a nodupkey on its own. So you need to take care of that and forcibly remove all duplicates where they might cause harm.

When dealing with large datasets, it's always an advantage to store them sorted by the column(s) primarily used for access.

In your case that means that keeping everything ordered by customerno will be a real booster.

This is the technique I use for all my datasets.

sriharivn
Fluorite | Level 6

Thanks. I tried this , sorted x by customerno with nodupkey. 
The process ran for over an hour and I had to kill it 😞

Kurt_Bremser
Super User

@Srihari_v_tcs_com wrote:

Thanks. I tried this , sorted x by customerno with nodupkey. 
The process ran for over an hour and I had to kill it 😞


Sorting a dataset with 1 million records and one variable should be done in less than a minute on any reasonably current computer. There's something seriously amiss here.

Do you work on networked drives?

Ksharp
Super User
data class;
 set sashelp.class;
run;
data class;
modify class;
if sex='F' then remove;
run;

Be careful, the table class might got corrupted .

sriharivn
Fluorite | Level 6

Thanks. 

Yes we have had corruption issues with the SCD load transformation in DI, so trying to move away from that to a simpler solution.

Ksharp
Super User
data class;
 set sashelp.class;
run;


proc iml;
edit class;
delete all where(sex='F');
purge;
quit;

Or if you have IML module .

s_lassen
Meteorite | Level 14

The problem may be that the SQL does the following, exactly because you have an index on the history table:

  1. Unique customer numbers are extracted from X. OK, takes very little time as X is small.
  2. For each customer number in X, the corresponding record is looked up using the composite index. Would be the fastest method with just a few (or a few hundred) records, but not with a million (if the history table is not sorted by the index variables, the lookup would be jumping from one disk segment to another all the time, meaning that the whole table may be read from disk many times over).

If you do things with a datastep, you can turn the process on its head, so that customer numbers are looked up in the small table instead, e.g.:

data customer_history;
  modify costumer_history;
  where current_flag='Y';
  set x(keep=customerno) key=customerno/unique;
  if _iorc_ then
    _error_=0; /* not found, reset _error_ to avoid error message */
  else 
    remove; /* delete the record from history */
run;

The idea is that the X table index hopefully is small enough to fit completely in memory. An alternative is to load the X table to a hash table, which executes slightly faster than the SET with KEY=, but on the other hand takes time to load. As I assume that you are only deleting one record for each value in X, I think the solution shown above is the fastest when X is already indexed.

sriharivn
Fluorite | Level 6

Thanks for your help.

 

I tried this. The first time it ran for about 45 min and did the job. Re-ran it and it went into a state where nothing happened and I had to kill the session.

 

After this, I thought I will sort the big (200m) dataset and try your method.

Sorting took about 25 min (off-peak hours). Delete then took just 6 mins !!!

 

I have checked this for multiple days and the behaviour is consistent.

I didn't know sorting a dataset would make that much of a difference !

 

Now trying to figure out if I can sort this dataset quickly! I guess I'm being greedy now ? 😉

 

Thanks to all of you for your help. 

Kurt_Bremser
Super User

25 minutes is not a big thing for that kind of data. Balance it with the performance improvement you get.

 

If you want to speed up the operation, first diagnose the process. Are your CPU's loaded to the brim while the sort is running, or do you have lots of I/O waits? I suspect the latter, in which case you need to start optimizing your storage.

Get a clear picture of your storage infrastructure.

Consider switching to SSD drives, and putting your logical volumes on stripesets of physical "disks", thereby spreading the physical load on several devices.

Keep permanent storage, WORK and UTILLOC on separate volumes, and also make sure that they are physically separate (if they are not on a SAN, in which case you need to work with your SAN admins). Avoid working on a network share by all means.

 

If you're CPU-bound, get faster or more processors.

Kurt_Bremser
Super User

Depending on your available memory, you could avoid the join altogether by creating a lookup format from dataset X:

proc sort
  data=x (keep=customerno)
  out=cntlin (rename=(customerno=start))
  nodupkey
;
by customerno;
run;

data cntlin;
set cntlin end=done;
fmtname = 'checkfmt';
type = 'C';
label = 'yes';
hlo = ' ';
output;
if done
then do;
  hlo = 'O';
  start = '**OTHER**';
  label = 'no';
  output;
end;

proc format library=work cntlin=cntlin;
run;

data customer_history_new;
set customer_history;
if current ne 'Y' or put(customer_no,$checkfmt.) = 'no';
run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 3408 views
  • 5 likes
  • 6 in conversation