I have to delete row from a huge table. Currently I am doing it with a merge, but it is extremely slow. Is there a way to delete it using has tables?
my code does not fail, but it does not delete the rows.
Here is what I am traying to do, using as an example a dataset from sashelp.
* My Production dataset will have about a billion rows;
data allCarsDataset;
set sashelp.cars;
data acuraDataset;
set sashelp.cars (where=(make='Acura'));
run;
*I need to delete all records from the AllCarsDataset, if they exist on the second dataset;
data _null_;
set WORK.allCarsDataset;
length Make $8;
if _n_=1 then do;
declare hash B(dataset:"acuraDataset");
B.definekey("Make");
B.definedata("Make");
B.definedone();
end;
rc2=B.find();
if rc2=0 then delete;
run;
If I understand the hash code you presented correctly, you want to delete records on the large table where the variable MAKE is on the small table. Hashing the smaller table may speed things up a bit, if it saves you a sort of the large table. A bigger saving may be had by not rewriting the large table, but just delete the records you do not want.
The hash solution you are looking for can be done with a MODIFY statement, meaning that the large table is not completely rewritten:
data WORK.allCarsDataset;
modify WORK.allCarsDataset;
if _n_=1 then do;
declare hash B(dataset:"acuraDataset");
B.definekey("Make");
B.definedata("Make");
B.definedone();
end;
rc2=B.find();
if rc2=0 then remove;
run;
But I do not think that would be faster than a similar SQL solution:
proc sql;
delete from allCarsDataset
where exists (select * from acuraDataset where make=allCarsDataset.make);
quit;
I used an EXISTS clause rather than specifying "where make in(select make from acuraDataset)", because the EXISTS condition is easily modified to use more than one key variable, e.g.
proc sql;
delete from allCarsDataset
where exists (select * from acuraDataset
where make=allCarsDataset.make and model=allCarsDataset.model
);
quit;
If you are not rewriting the large table often, but appending/deleting smaller amounts of data, it may pay off to put an index on the relevant set of keys, e.g.
proc sql;
create index idx on allCarsDataset(make,model);
quit;
Using an index, the SQL solution is sure to be faster than the hash solution, as the hash solution still reads all of the large table.
Once in a while, you should then rewrite the large table and recreate the index. You can just use a simple datastep, but it may be more efficient to sort the table by the index variables (you can schedule the rebuild to run when there is time for it):
proc sort data=allCarsDataset force;
by make model;
run;
proc sql;
create index idx on allCarsDataset(make,model);
quit;
You run a data _null_ step; such steps do not write a dataset anywhere.
To use a hash, you need to create a new dataset:
data allcarsdataset_new;
set allCarsDataset;
/* no length needed, MAKE is already on input dataset */
if _n_ = 1 then do;
declare hash B(dataset:"acuraDataset");
B.definekey("Make");
B.definedone();
end;
if b.check() ne 0; /* key not found */
run;
Since you only have a key, no DEFINEDATA method is necessary, and you use the CHECK method instead of FIND.
This is helpful. I am evaluating how to actually modify the dataset, but re-write is an option. Thank you for your reply!
So you want to modify your source dataset? That is not normally something I would do with SAS. I would use some type of database system to manage data that can be modified and just use SAS to do my analysis of that data.
But if you change the dataset in place it should work faster than trying to copy the subset to a new dataset.
Look at the MODIFY statement. https://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n0g9jfr4x5hgsfn17gtma5547lt1.htm&doc...
If I understand the hash code you presented correctly, you want to delete records on the large table where the variable MAKE is on the small table. Hashing the smaller table may speed things up a bit, if it saves you a sort of the large table. A bigger saving may be had by not rewriting the large table, but just delete the records you do not want.
The hash solution you are looking for can be done with a MODIFY statement, meaning that the large table is not completely rewritten:
data WORK.allCarsDataset;
modify WORK.allCarsDataset;
if _n_=1 then do;
declare hash B(dataset:"acuraDataset");
B.definekey("Make");
B.definedata("Make");
B.definedone();
end;
rc2=B.find();
if rc2=0 then remove;
run;
But I do not think that would be faster than a similar SQL solution:
proc sql;
delete from allCarsDataset
where exists (select * from acuraDataset where make=allCarsDataset.make);
quit;
I used an EXISTS clause rather than specifying "where make in(select make from acuraDataset)", because the EXISTS condition is easily modified to use more than one key variable, e.g.
proc sql;
delete from allCarsDataset
where exists (select * from acuraDataset
where make=allCarsDataset.make and model=allCarsDataset.model
);
quit;
If you are not rewriting the large table often, but appending/deleting smaller amounts of data, it may pay off to put an index on the relevant set of keys, e.g.
proc sql;
create index idx on allCarsDataset(make,model);
quit;
Using an index, the SQL solution is sure to be faster than the hash solution, as the hash solution still reads all of the large table.
Once in a while, you should then rewrite the large table and recreate the index. You can just use a simple datastep, but it may be more efficient to sort the table by the index variables (you can schedule the rebuild to run when there is time for it):
proc sort data=allCarsDataset force;
by make model;
run;
proc sql;
create index idx on allCarsDataset(make,model);
quit;
This is very helpful. Thank you!
Happy to hear that it was useful.
Just curious: what solution did you end up implementing?
I will set up a maintenance task that re-creates the table with the index, in a regular basis. I have not decided how often, I will have to see how fast the performance degrades.
For the delete, I will just use a Proc sql delete statement, but the modify performed just as fast. My job will go from 2 hours to 15 minutes.
However the maintenance task will take from 30 to 60 mins, depending on how busy is the server. Still a huge gain even if I recreated it every time.
Thank you again.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.