Hi all,
Have general question.
I need to delete ~70% of observations from huge tables,millions of obs., the bigger has ~100 gigs, rest ~30 gig.
Table contains snapshots of data, by separate dates. "Date" column has needed index.Table also conatins a few diff. indexes by diff columns.
So most obvious variant - delete by index column, smth like:
proc sql;
delete * from tabname where dato<&cutDate;
quit;
Also looks like efficient variant should be:
So maybe someone have experience with similar tasks, maybe there are some efficient procedure that doesn't change tables bufsize and delete records more efficient then simple delete?
Thanks!
All depends on the physical location of that data. With 100Gb I would not expect a classic SAS-dataset on a single spindle location.
Suppose you have the data locates on a dedicated external RDBMS and you need first to validate &cutdate. Than a view defined in that external dbms could be a good fit.
When going to explore that heavily you can materialize that data in a new table inside that external RDBMS.
Suppose you want that 30% needing it local in your SAS environment than you can copy that (create new table).
In this approach while proceeding your work you are getting smaller data and are able increase analytic processing.
Dropping tables is deleting all of the data, it set you loose on going back in some steps.
Deleting rows can time consuming because of locking - random IO access.
Creating a new table (sas-dataset) using Proc SQL is able to take advantage of threading. Disadvantage can be random IO (most time consuming)
Creating a new table (sas-dataset) using a datastep can take advantage of minimizing IO-time. But is working sequentially one thread.
With spinning harddisks (no SSD available) some trade off is normally about 20% of the data needed. Above that the sas datastep is faster below the proc sql.
All depends on the physical location of that data. With 100Gb I would not expect a classic SAS-dataset on a single spindle location.
Suppose you have the data locates on a dedicated external RDBMS and you need first to validate &cutdate. Than a view defined in that external dbms could be a good fit.
When going to explore that heavily you can materialize that data in a new table inside that external RDBMS.
Suppose you want that 30% needing it local in your SAS environment than you can copy that (create new table).
In this approach while proceeding your work you are getting smaller data and are able increase analytic processing.
Dropping tables is deleting all of the data, it set you loose on going back in some steps.
Deleting rows can time consuming because of locking - random IO access.
Creating a new table (sas-dataset) using Proc SQL is able to take advantage of threading. Disadvantage can be random IO (most time consuming)
Creating a new table (sas-dataset) using a datastep can take advantage of minimizing IO-time. But is working sequentially one thread.
With spinning harddisks (no SSD available) some trade off is normally about 20% of the data needed. Above that the sas datastep is faster below the proc sql.
Hi Jaap,
Actually that 100 gig. table isn't external RDBMS table, it's real old, lost and forgotten SAS table that I need to cut as quick as possible:).
Please corrrect me if I understood your idea incorrect:
So you advices to create new smaller(30%) table using data step, instead of proc sql as I wrote in my post higher?
So final code will looks like:
1.
data table_cutted;
set table(where =(dato>&cutDate));
run;
2.Rename:
proc datasets lib=...;
change table_cutted=table;
quit;
3.Recreate indexes constrants
Thanks!
Yep it should be converted that way into the latest SAS version you are using now. I guessing that real old lost and forgotten one is made by an old SAS version.
Hi Jaap,
Just found one more impoovment that speed-up deleteing in 4!!! times in my case.
So generally I used same flow as we discuss higher, but before first step I reset size of buffer to maximal in SAS session:
options bufno=1 bufsize=max;
And that options speed-up quering those 30% of data in 4 times, but obviously results table become bigger in size due maxbuff option.
So after that I roll back previous buffer size on created data set(from max to usual 4096), and that operation takes only 2 minutes, quick, and the table size was decreased and become absolutely the same if I wouldn't use maxbuff option.
For now I don't see minuses of such step, so probably will use bufsize=max options in similar cases.
Thanks!
Makes sense as increasing the bufsize will minimize IO. The limits of the bufsize are dependent of the OS and SAS version.
The OS is often have already some caching as the same with the spinning hard-disk. (A SSD is not not using a spining device).
As of SAS 9.3 latest version you may be can use SASAlignIofiles to improve the process between SAS and the IO system
SAS(R) 9.4 Companion for UNIX Environments, Fourth Edition The max bufsize is 2Gb
A bufzise of 4K is really small only advicable for small datasets. That setting is for minimizing spilled space in the last buffer/block.
Even in the old mainframe days I preferred the 27K setting (half track) for this speed reason.
There is an disadvantage with that big bufsize. That is at the moment you are going to use random access (indexed).
Hitting random a buffer somewhere will need to load/reload those pages of 2Gb into memory even if your record is only 1Mb. (that is a big record)
Hi again Jaap,
Regarding your last descriptions about disadvantages:- I actually roll back previous bufsize afterwords, and for my surprize on 6 gigabytes table such operation takes just a few minutes.
But you are right regarding minuses of big bufsize, thanks for good explanations.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.