BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yura2301
Quartz | Level 8

Hi all,

Have general question.

I need to delete ~70% of observations from huge tables,millions of obs., the bigger has ~100 gigs, rest ~30 gig.

Table contains snapshots of data, by separate dates. "Date" column has needed index.Table also conatins a few diff. indexes by diff columns.

So most obvious variant - delete by index column, smth like:

proc sql;

      delete * from tabname   where dato<&cutDate;

quit;

Also looks like efficient variant should be:

  1. Select needed records(~30%), put them to separate new table in same library.
  2. Drop old huge table.
  3. Rename table created on first point to needed table.
  4. Recreate indexes.

So maybe someone have experience with similar tasks, maybe there are some efficient procedure that doesn't change tables bufsize and delete records more efficient then simple delete?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
jakarman
Barite | Level 11

All depends on the physical location of that data. With 100Gb I would not expect a classic SAS-dataset on a single spindle location.

Suppose you have the data locates on a dedicated external RDBMS and you need first to validate &cutdate. Than a view defined in that external dbms could be a good fit.

When going to explore that heavily you can materialize that data in a new table inside that external RDBMS.

Suppose you want that 30% needing it local in your SAS environment than you can copy that (create new table).

In this approach while proceeding your work you are getting smaller data and are able increase analytic processing.

Dropping tables is deleting all of the data, it set you loose on going back in some steps.

Deleting rows can time consuming because of locking - random IO access.

Creating a new table (sas-dataset) using Proc SQL is able to take advantage of threading. Disadvantage can be random IO (most time consuming)

Creating a new table (sas-dataset) using a datastep can take advantage of minimizing IO-time. But is working sequentially one thread.

With spinning harddisks (no SSD available) some trade off is normally about 20% of the data needed.  Above that the sas datastep is faster below the proc sql.          

---->-- ja karman --<-----

View solution in original post

6 REPLIES 6
jakarman
Barite | Level 11

All depends on the physical location of that data. With 100Gb I would not expect a classic SAS-dataset on a single spindle location.

Suppose you have the data locates on a dedicated external RDBMS and you need first to validate &cutdate. Than a view defined in that external dbms could be a good fit.

When going to explore that heavily you can materialize that data in a new table inside that external RDBMS.

Suppose you want that 30% needing it local in your SAS environment than you can copy that (create new table).

In this approach while proceeding your work you are getting smaller data and are able increase analytic processing.

Dropping tables is deleting all of the data, it set you loose on going back in some steps.

Deleting rows can time consuming because of locking - random IO access.

Creating a new table (sas-dataset) using Proc SQL is able to take advantage of threading. Disadvantage can be random IO (most time consuming)

Creating a new table (sas-dataset) using a datastep can take advantage of minimizing IO-time. But is working sequentially one thread.

With spinning harddisks (no SSD available) some trade off is normally about 20% of the data needed.  Above that the sas datastep is faster below the proc sql.          

---->-- ja karman --<-----
Yura2301
Quartz | Level 8

Hi Jaap,

Actually that 100 gig. table isn't external RDBMS table, it's real old, lost and forgotten SAS table that I need to cut as quick as possible:).

Please corrrect me if I understood your idea incorrect:

So you advices to create new smaller(30%) table using data step, instead of proc sql as I wrote in my post higher?

So final code will looks like:

1.

data table_cutted;

     set table(where =(dato>&cutDate));

run;
2.Rename:

 

proc datasets lib=...;

change table_cutted=table;

quit;

3.Recreate indexes constrants

Thanks!

jakarman
Barite | Level 11

Yep it should be converted that way into the latest SAS version you are using now. I guessing that real old lost and forgotten one is made by an old SAS version. 

---->-- ja karman --<-----
Yura2301
Quartz | Level 8

Hi Jaap,

Just found one more impoovment that speed-up deleteing in 4!!! times in my case.

So generally I used same flow as we discuss higher, but before first step I reset size of buffer to maximal in SAS session:

options bufno=1 bufsize=max;

And that options speed-up quering those 30% of data in 4 times, but obviously results table become bigger in size due maxbuff option.

So after that I roll back previous buffer size on created data set(from max to usual 4096), and that operation takes only 2 minutes, quick, and the table size was decreased and become absolutely the same if I wouldn't use maxbuff option.

For now I don't see minuses of such step, so probably will use bufsize=max options in similar cases.

Thanks!

jakarman
Barite | Level 11

Makes sense as increasing the bufsize will minimize IO. The limits of the bufsize are dependent of the OS  and SAS version.

The OS is often have already some caching as the same with the spinning hard-disk. (A SSD is not not using a spining device).

As of SAS 9.3 latest version you may be can use SASAlignIofiles to improve the process between SAS and the IO system

SAS(R) 9.4 Companion for UNIX Environments, Fourth Edition The max bufsize is 2Gb

A bufzise of 4K is really small only advicable for small datasets. That setting is for minimizing spilled space in the last buffer/block.

Even in the old mainframe days I preferred the 27K setting (half track) for this speed reason.

There is an disadvantage with that big bufsize. That is at the moment you are going to use random access (indexed).

Hitting random a buffer somewhere will need to load/reload those pages of 2Gb into memory even if your  record is only 1Mb. (that is a big record)

---->-- ja karman --<-----
Yura2301
Quartz | Level 8

Hi again Jaap,

Regarding your last descriptions about disadvantages:- I actually roll back previous bufsize afterwords, and for my surprize on 6 gigabytes table such operation takes just a few minutes.

But you are right regarding minuses of big bufsize, thanks for good explanations.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 12252 views
  • 1 like
  • 2 in conversation