DATA Step, Macro, Functions and more

The most efficient way of deleting observations from huge table.

Accepted Solution Solved
Reply
Regular Contributor
Posts: 161
Accepted Solution

The most efficient way of deleting observations from huge table.

Hi all,

Have general question.

I need to delete ~70% of observations from huge tables,millions of obs., the bigger has ~100 gigs, rest ~30 gig.

Table contains snapshots of data, by separate dates. "Date" column has needed index.Table also conatins a few diff. indexes by diff columns.

So most obvious variant - delete by index column, smth like:

proc sql;

      delete * from tabname   where dato<&cutDate;

quit;

Also looks like efficient variant should be:

  1. Select needed records(~30%), put them to separate new table in same library.
  2. Drop old huge table.
  3. Rename table created on first point to needed table.
  4. Recreate indexes.

So maybe someone have experience with similar tasks, maybe there are some efficient procedure that doesn't change tables bufsize and delete records more efficient then simple delete?

Thanks!


Accepted Solutions
Solution
‎01-21-2015 10:24 AM
Trusted Advisor
Posts: 3,211

Re: The most efficient way of deleting observations from huge table.

All depends on the physical location of that data. With 100Gb I would not expect a classic SAS-dataset on a single spindle location.

Suppose you have the data locates on a dedicated external RDBMS and you need first to validate &cutdate. Than a view defined in that external dbms could be a good fit.

When going to explore that heavily you can materialize that data in a new table inside that external RDBMS.

Suppose you want that 30% needing it local in your SAS environment than you can copy that (create new table).

In this approach while proceeding your work you are getting smaller data and are able increase analytic processing.

Dropping tables is deleting all of the data, it set you loose on going back in some steps.

Deleting rows can time consuming because of locking - random IO access.

Creating a new table (sas-dataset) using Proc SQL is able to take advantage of threading. Disadvantage can be random IO (most time consuming)

Creating a new table (sas-dataset) using a datastep can take advantage of minimizing IO-time. But is working sequentially one thread.

With spinning harddisks (no SSD available) some trade off is normally about 20% of the data needed.  Above that the sas datastep is faster below the proc sql.          

---->-- ja karman --<-----

View solution in original post


All Replies
Solution
‎01-21-2015 10:24 AM
Trusted Advisor
Posts: 3,211

Re: The most efficient way of deleting observations from huge table.

All depends on the physical location of that data. With 100Gb I would not expect a classic SAS-dataset on a single spindle location.

Suppose you have the data locates on a dedicated external RDBMS and you need first to validate &cutdate. Than a view defined in that external dbms could be a good fit.

When going to explore that heavily you can materialize that data in a new table inside that external RDBMS.

Suppose you want that 30% needing it local in your SAS environment than you can copy that (create new table).

In this approach while proceeding your work you are getting smaller data and are able increase analytic processing.

Dropping tables is deleting all of the data, it set you loose on going back in some steps.

Deleting rows can time consuming because of locking - random IO access.

Creating a new table (sas-dataset) using Proc SQL is able to take advantage of threading. Disadvantage can be random IO (most time consuming)

Creating a new table (sas-dataset) using a datastep can take advantage of minimizing IO-time. But is working sequentially one thread.

With spinning harddisks (no SSD available) some trade off is normally about 20% of the data needed.  Above that the sas datastep is faster below the proc sql.          

---->-- ja karman --<-----
Regular Contributor
Posts: 161

Re: The most efficient way of deleting observations from huge table.

Hi Jaap,

Actually that 100 gig. table isn't external RDBMS table, it's real old, lost and forgotten SAS table that I need to cut as quick as possibleSmiley Happy.

Please corrrect me if I understood your idea incorrect:

So you advices to create new smaller(30%) table using data step, instead of proc sql as I wrote in my post higher?

So final code will looks like:

1.

data table_cutted;

     set table(where =(dato>&cutDate));

run;
2.Rename:

 

proc datasets lib=...;

change table_cutted=table;

quit;

3.Recreate indexes constrants

Thanks!

Trusted Advisor
Posts: 3,211

Re: The most efficient way of deleting observations from huge table.

Yep it should be converted that way into the latest SAS version you are using now. I guessing that real old lost and forgotten one is made by an old SAS version. 

---->-- ja karman --<-----
Regular Contributor
Posts: 161

Re: The most efficient way of deleting observations from huge table.

Hi Jaap,

Just found one more impoovment that speed-up deleteing in 4!!! times in my case.

So generally I used same flow as we discuss higher, but before first step I reset size of buffer to maximal in SAS session:

options bufno=1 bufsize=max;

And that options speed-up quering those 30% of data in 4 times, but obviously results table become bigger in size due maxbuff option.

So after that I roll back previous buffer size on created data set(from max to usual 4096), and that operation takes only 2 minutes, quick, and the table size was decreased and become absolutely the same if I wouldn't use maxbuff option.

For now I don't see minuses of such step, so probably will use bufsize=max options in similar cases.

Thanks!

Trusted Advisor
Posts: 3,211

Re: The most efficient way of deleting observations from huge table.

Makes sense as increasing the bufsize will minimize IO. The limits of the bufsize are dependent of the OS  and SAS version.

The OS is often have already some caching as the same with the spinning hard-disk. (A SSD is not not using a spining device).

As of SAS 9.3 latest version you may be can use SASAlignIofiles to improve the process between SAS and the IO system

SAS(R) 9.4 Companion for UNIX Environments, Fourth Edition The max bufsize is 2Gb

A bufzise of 4K is really small only advicable for small datasets. That setting is for minimizing spilled space in the last buffer/block.

Even in the old mainframe days I preferred the 27K setting (half track) for this speed reason.

There is an disadvantage with that big bufsize. That is at the moment you are going to use random access (indexed).

Hitting random a buffer somewhere will need to load/reload those pages of 2Gb into memory even if your  record is only 1Mb. (that is a big record)

---->-- ja karman --<-----
Regular Contributor
Posts: 161

Re: The most efficient way of deleting observations from huge table.

Hi again Jaap,

Regarding your last descriptions about disadvantages:- I actually roll back previous bufsize afterwords, and for my surprize on 6 gigabytes table such operation takes just a few minutes.

But you are right regarding minuses of big bufsize, thanks for good explanations.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 5688 views
  • 0 likes
  • 2 in conversation