BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
spastranas
Fluorite | Level 6

I have to delete row from a huge table. Currently I am doing it with a merge, but it is extremely slow. Is there a way to delete it using has tables?

my code does not fail, but it does not delete the rows.

 

Here is what I am traying to do, using as an example a dataset from sashelp.

 

* My Production dataset will have about a billion rows;

data allCarsDataset;

set sashelp.cars;

data acuraDataset;

set sashelp.cars (where=(make='Acura'));

run;

 

*I need to delete all records from the AllCarsDataset, if they exist on the second dataset;

data _null_;

set WORK.allCarsDataset;

length Make $8;

if _n_=1 then do;

 

 

declare hash B(dataset:"acuraDataset");

B.definekey("Make");

B.definedata("Make");

 

B.definedone();

 

end;

 

 

rc2=B.find();

if rc2=0 then delete;

 

 

run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
s_lassen
Meteorite | Level 14

If I understand the hash code you presented correctly, you want to delete records on the large table where the variable MAKE is on the small table. Hashing the smaller table may speed things up a bit, if it saves you a sort of the large table. A bigger saving may be had by not rewriting the large table, but just delete the records you do not want.

 

The hash solution you are looking for can be done with a MODIFY statement, meaning that the large table is not completely rewritten:

 

data WORK.allCarsDataset;
  modify WORK.allCarsDataset;
  if _n_=1 then do;
    declare hash B(dataset:"acuraDataset");
    B.definekey("Make");
    B.definedata("Make");
    B.definedone();
    end;
  rc2=B.find();
  if rc2=0 then remove;
run;

 

 

But I do not think that would be faster than a similar SQL solution:

 

proc sql;
  delete from allCarsDataset
  where exists (select * from acuraDataset where make=allCarsDataset.make);
quit;

 

 

I used an EXISTS clause rather than specifying "where make in(select make from acuraDataset)", because the EXISTS condition is easily modified to use more than one key variable, e.g.

proc sql;
  delete from allCarsDataset
  where exists (select * from acuraDataset 
                 where make=allCarsDataset.make and model=allCarsDataset.model
                    );
quit;

 

If you are not rewriting the large table often, but appending/deleting smaller amounts of data, it may pay off to put an index on the relevant set of keys, e.g.

proc sql;
  create index idx on allCarsDataset(make,model);
quit;

Using an index, the SQL solution is sure to be faster than the hash solution, as the hash solution still reads all of the large table.

 

Once in a while, you should then rewrite the large table and recreate the index. You can just use a simple datastep, but it may be more efficient to sort the table by the index variables (you can schedule the rebuild to run when there is time for it):

proc sort data=allCarsDataset force;
  by make model;
run;

proc sql;
  create index idx on allCarsDataset(make,model);
quit;

 

 

 

 

View solution in original post

7 REPLIES 7
Kurt_Bremser
Super User

You run a data _null_ step; such steps do not write a dataset anywhere.

To use a hash, you need to create a new dataset:

data allcarsdataset_new;
set allCarsDataset;
/* no length needed, MAKE is already on input dataset */
if _n_ = 1 then do;
  declare hash B(dataset:"acuraDataset");
  B.definekey("Make");
  B.definedone();
end;
if b.check() ne 0; /* key not found */
run;

Since you only have a key, no DEFINEDATA method is necessary, and you use the CHECK method instead of FIND.

spastranas
Fluorite | Level 6

This is helpful. I am evaluating how to actually modify the dataset, but re-write is an option. Thank you for your reply!

Tom
Super User Tom
Super User

So you want to modify your source dataset?  That is not normally something I would do with SAS.  I would use some type of database system to manage data that can be modified and just use SAS to do my analysis of that data.

 

But if you change the dataset in place it should work faster than trying to copy the subset to a new dataset.  

Look at the MODIFY statement. https://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n0g9jfr4x5hgsfn17gtma5547lt1.htm&doc...

s_lassen
Meteorite | Level 14

If I understand the hash code you presented correctly, you want to delete records on the large table where the variable MAKE is on the small table. Hashing the smaller table may speed things up a bit, if it saves you a sort of the large table. A bigger saving may be had by not rewriting the large table, but just delete the records you do not want.

 

The hash solution you are looking for can be done with a MODIFY statement, meaning that the large table is not completely rewritten:

 

data WORK.allCarsDataset;
  modify WORK.allCarsDataset;
  if _n_=1 then do;
    declare hash B(dataset:"acuraDataset");
    B.definekey("Make");
    B.definedata("Make");
    B.definedone();
    end;
  rc2=B.find();
  if rc2=0 then remove;
run;

 

 

But I do not think that would be faster than a similar SQL solution:

 

proc sql;
  delete from allCarsDataset
  where exists (select * from acuraDataset where make=allCarsDataset.make);
quit;

 

 

I used an EXISTS clause rather than specifying "where make in(select make from acuraDataset)", because the EXISTS condition is easily modified to use more than one key variable, e.g.

proc sql;
  delete from allCarsDataset
  where exists (select * from acuraDataset 
                 where make=allCarsDataset.make and model=allCarsDataset.model
                    );
quit;

 

If you are not rewriting the large table often, but appending/deleting smaller amounts of data, it may pay off to put an index on the relevant set of keys, e.g.

proc sql;
  create index idx on allCarsDataset(make,model);
quit;

Using an index, the SQL solution is sure to be faster than the hash solution, as the hash solution still reads all of the large table.

 

Once in a while, you should then rewrite the large table and recreate the index. You can just use a simple datastep, but it may be more efficient to sort the table by the index variables (you can schedule the rebuild to run when there is time for it):

proc sort data=allCarsDataset force;
  by make model;
run;

proc sql;
  create index idx on allCarsDataset(make,model);
quit;

 

 

 

 

spastranas
Fluorite | Level 6

This is very helpful. Thank you!

s_lassen
Meteorite | Level 14

Happy to hear that it was useful.

 

Just curious: what solution did you end up implementing?

spastranas
Fluorite | Level 6

I will set up a maintenance task that re-creates the table with the index, in a regular basis. I have not decided how often, I will have to see how fast the performance degrades.

For the delete, I will just use a Proc sql delete statement, but the modify performed just as fast. My job will go from 2 hours to 15 minutes.

However the maintenance task  will take from 30 to 60 mins, depending on how busy is the server. Still a huge gain even if I recreated it every time.

Thank you again. 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1314 views
  • 3 likes
  • 4 in conversation