Solved: Delete data from a huge dataset using hash

spastranas · Posted 08-11-2020 10:02 AM

I have to delete row from a huge table. Currently I am doing it with a merge, but it is extremely slow. Is there a way to delete it using has tables?

my code does not fail, but it does not delete the rows.

Here is what I am traying to do, using as an example a dataset from sashelp.

* My Production dataset will have about a billion rows;

data allCarsDataset;

set sashelp.cars;

data acuraDataset;

set sashelp.cars (where=(make='Acura'));

run;

*I need to delete all records from the AllCarsDataset, if they exist on the second dataset;

data _null_;

set WORK.allCarsDataset;

length Make $8;

if _n_=1 then do;

declare hash B(dataset:"acuraDataset");

B.definekey("Make");

B.definedata("Make");

B.definedone();

end;

rc2=B.find();

if rc2=0 then delete;

run;

s_lassen · Posted 08-12-2020 02:43 AM

If I understand the hash code you presented correctly, you want to delete records on the large table where the variable MAKE is on the small table. Hashing the smaller table may speed things up a bit, if it saves you a sort of the large table. A bigger saving may be had by not rewriting the large table, but just delete the records you do not want.

The hash solution you are looking for can be done with a MODIFY statement, meaning that the large table is not completely rewritten:

data WORK.allCarsDataset;
  modify WORK.allCarsDataset;
  if _n_=1 then do;
    declare hash B(dataset:"acuraDataset");
    B.definekey("Make");
    B.definedata("Make");
    B.definedone();
    end;
  rc2=B.find();
  if rc2=0 then remove;
run;

But I do not think that would be faster than a similar SQL solution:

proc sql;
  delete from allCarsDataset
  where exists (select * from acuraDataset where make=allCarsDataset.make);
quit;

I used an EXISTS clause rather than specifying "where make in(select make from acuraDataset)", because the EXISTS condition is easily modified to use more than one key variable, e.g.

proc sql;
  delete from allCarsDataset
  where exists (select * from acuraDataset 
                 where make=allCarsDataset.make and model=allCarsDataset.model
                    );
quit;

If you are not rewriting the large table often, but appending/deleting smaller amounts of data, it may pay off to put an index on the relevant set of keys, e.g.

proc sql;
  create index idx on allCarsDataset(make,model);
quit;

Using an index, the SQL solution is sure to be faster than the hash solution, as the hash solution still reads all of the large table.

Once in a while, you should then rewrite the large table and recreate the index. You can just use a simple datastep, but it may be more efficient to sort the table by the index variables (you can schedule the rebuild to run when there is time for it):

proc sort data=allCarsDataset force;
  by make model;
run;

proc sql;
  create index idx on allCarsDataset(make,model);
quit;

View solution in original post

Kurt_Bremser · Posted 08-11-2020 10:09 AM

You run a data _null_ step; such steps do not write a dataset anywhere.

To use a hash, you need to create a new dataset:

data allcarsdataset_new;
set allCarsDataset;
/* no length needed, MAKE is already on input dataset */
if _n_ = 1 then do;
  declare hash B(dataset:"acuraDataset");
  B.definekey("Make");
  B.definedone();
end;
if b.check() ne 0; /* key not found */
run;

Since you only have a key, no DEFINEDATA method is necessary, and you use the CHECK method instead of FIND.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

spastranas · Posted 08-14-2020 11:09 AM

This is helpful. I am evaluating how to actually modify the dataset, but re-write is an option. Thank you for your reply!

Tom · Posted 08-11-2020 10:13 AM

So you want to modify your source dataset? That is not normally something I would do with SAS. I would use some type of database system to manage data that can be modified and just use SAS to do my analysis of that data.

But if you change the dataset in place it should work faster than trying to copy the subset to a new dataset.

Look at the MODIFY statement. https://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n0g9jfr4x5hgsfn17gtma5547lt1.htm&doc...

s_lassen · Posted 08-12-2020 02:43 AM

If I understand the hash code you presented correctly, you want to delete records on the large table where the variable MAKE is on the small table. Hashing the smaller table may speed things up a bit, if it saves you a sort of the large table. A bigger saving may be had by not rewriting the large table, but just delete the records you do not want.

The hash solution you are looking for can be done with a MODIFY statement, meaning that the large table is not completely rewritten:

data WORK.allCarsDataset;
  modify WORK.allCarsDataset;
  if _n_=1 then do;
    declare hash B(dataset:"acuraDataset");
    B.definekey("Make");
    B.definedata("Make");
    B.definedone();
    end;
  rc2=B.find();
  if rc2=0 then remove;
run;

But I do not think that would be faster than a similar SQL solution:

proc sql;
  delete from allCarsDataset
  where exists (select * from acuraDataset where make=allCarsDataset.make);
quit;

I used an EXISTS clause rather than specifying "where make in(select make from acuraDataset)", because the EXISTS condition is easily modified to use more than one key variable, e.g.

proc sql;
  delete from allCarsDataset
  where exists (select * from acuraDataset 
                 where make=allCarsDataset.make and model=allCarsDataset.model
                    );
quit;

If you are not rewriting the large table often, but appending/deleting smaller amounts of data, it may pay off to put an index on the relevant set of keys, e.g.

proc sql;
  create index idx on allCarsDataset(make,model);
quit;

Using an index, the SQL solution is sure to be faster than the hash solution, as the hash solution still reads all of the large table.

Once in a while, you should then rewrite the large table and recreate the index. You can just use a simple datastep, but it may be more efficient to sort the table by the index variables (you can schedule the rebuild to run when there is time for it):

proc sort data=allCarsDataset force;
  by make model;
run;

proc sql;
  create index idx on allCarsDataset(make,model);
quit;

spastranas · Posted 08-14-2020 11:11 AM

This is very helpful. Thank you!

s_lassen · Posted 08-17-2020 12:15 PM

Happy to hear that it was useful.

Just curious: what solution did you end up implementing?

spastranas · Posted 08-19-2020 11:25 AM

I will set up a maintenance task that re-creates the table with the index, in a regular basis. I have not decided how often, I will have to see how fast the performance degrades.

For the delete, I will just use a Proc sql delete statement, but the modify performed just as fast. My job will go from 2 hours to 15 minutes.

However the maintenance task will take from 30 to 60 mins, depending on how busy is the server. Still a huge gain even if I recreated it every time.

Thank you again.

Delete data from a huge dataset using hash

Re: Delete data from a huge dataset using hash

Re: Delete data from a huge dataset using hash

Re: Delete data from a huge dataset using hash

Re: Delete data from a huge dataset using hash

Re: Delete data from a huge dataset using hash

Re: Delete data from a huge dataset using hash

Re: Delete data from a huge dataset using hash

Re: Delete data from a huge dataset using hash

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away