About sriharivn

sriharivn · ‎01-15-2018

Thanks for your help. I tried this. The first time it ran for about 45 min and did the job. Re-ran it and it went into a state where nothing happened and I had to kill the session. After this, I thought I will sort the big (200m) dataset and try your method. Sorting took about 25 min (off-peak hours). Delete then took just 6 mins !!! I have checked this for multiple days and the behaviour is consistent. I didn't know sorting a dataset would make that much of a difference ! Now trying to figure out if I can sort this dataset quickly! I guess I'm being greedy now ? 😉 Thanks to all of you for your help.

sriharivn · ‎01-10-2018

Thanks Kurt. One clarification on the last step. I want to keep all the old records for the customer (flag='N') and delete only the Ys in the history for the customers in X. I want to keep the Ys for the other records which are not present in X. data customer_history_new; set customer_history; if current ne 'Y' or put(customer_no,$checkfmt.) = 'no'; run; I guess this will remove all the Ys in the history and give me only Ns and also remove the X customers. Is my understanding right?

sriharivn · ‎01-10-2018

Thanks. I tried this , sorted x by customerno with nodupkey. The process ran for over an hour and I had to kill it 😞

sriharivn · ‎01-10-2018

Thanks. Yes we have had corruption issues with the SCD load transformation in DI, so trying to move away from that to a simpler solution.

sriharivn · ‎01-10-2018

Thank you for your reply. I did forget the basic idea about using indexes with subset of data. thanks for reminding me about it. I know you say index is an overhead in this case. But because both datasets are indexed on customerno, can I get away without sorting these 2 tables and do this merge right away? I thought index(logically) is a variable sorted with no duplicate entries. so I would imagine i don't have to sort. Please confirm if this wrong for what i am trying to do. Thanks.

sriharivn · ‎01-10-2018

Thanks for your reply. No, they are currently not in the same structure. The history one has all customer attributes name,dob, address etc about 100 columns. the second table just tells me the customer numbers which have changed data in any of the 100 columns compared to the history. This is then processed to mark old record as expired and set the new records as current for each customerno. Short answer the second one just has customerno and is indexed. I can get it in the same structure if it helps. I don't want to create a new dataset ideally unless unavoidable. I am trying to avoid I/O but introduced heavy I/O with delete process 😞

sriharivn · ‎01-10-2018

Hi, I am trying to delete rows from one table based on another table. The table from where I want to delete rows has about 200 million records and is growing everyday. This is a customer dimension mart which tracks changes on customer attributes. As part of the ETL process, I am deleting changed (current) records and appending changed& new records. The whole process is quick except for the deletion process as its my query which I believe may not be optimized. I do this currently : proc sql; delete * from customer_history where (select distinct customerno from x) and current_flag='Y'; Delete process takes more than 2 hours which defeats the purpose. X has about 800,000 to 1 million customerno generally. X is indexed on customerno. History table is indexed on customer, current_flag and a composite one having both.History has about 200m records. I understand the process I run currently has to traverse through all 200m records and apply the filter as part of the delete process. I can probably modify the ETL a bit to do this differently but before I do that I want to see if I am doing anything fundamentally wrong here? Any help is much appreciated. Thanks.

Online Status	Offline
Date Last Visited	‎04-30-2019 10:21 AM

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Re: Delete rows quickly from table containing 200 million observations

Delete rows quickly from table containing 200 million observations