BookmarkSubscribeRSS Feed
Maicemuallem
Calcite | Level 5

when i use proc update to update 12 fields in a 5.5M data set, it takes a long time.

Any suggestions to replace proc update?

 

6 REPLIES 6
ChrisHemedinger
Community Manager

Do you mean PROC SQL with the UPDATE statement?  You might want to share some more details about your data scheme and how you're having to update records within it.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
Maicemuallem
Calcite | Level 5

yes. It is a proc sql.

my data has 475 variables and 5.5 Million observations. I am interested in updating 11 variables for only 65,000 observations

RW9
Diamond | Level 26 RW9
Diamond | Level 26

I don't have anything to test this on, but basic datastep has an update statement as well:

data want;
  update <master_dataset> <update_dateset>;
  by <id_variables>;
run;

This should run far quicker than SQL - the SQL in SAS tends to be quite resource intensive.  

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202975.htm

FreelanceReinh
Jade | Level 19

I created test data:

  • dataset BIG with 475 variables (one numeric ID with length 8 and 474 character variables of length 1, populated with random letters) and 5.5 million observations (sorted by ID, file size 2.5 GB)
  • dataset TA (for transaction) with 65000 obs. and 12 variables: ID with randomly selected values from BIG and 11 randomly selected character variables from BIG, populated with '#' characters and about 2% missing values, also sorted by ID.

Then I let a data step with UPDATE statement, as suggested by @RW9, and an equivalent PROC SQL step (using a LEFT JOIN and the COALESCE function, not an UPDATE statement) run three times each to create the updated dataset WANT (or WANT1, resp.).

 

Run times (in seconds):

Data step: 5.09, 5.52, 5.64

PROC SQL: 9.03, 9.47, 9.41

 

The difference is highly significant (t test, p<.0001). 

 

For another comparison, I updated BIG directly ("in place") using PROC SQL's UPDATE statement and the WHERE clause

where id in (select id from ta);

With (hard-coded) constant transaction data ('#') this was faster (average run time 4.4 s), but with a subquery pulling the real transaction data from dataset TA for only one of the 11 variables to be updated, the run time went up to about 87 s and would probably have been much higher, had this been done for the remaining 10 character variables as well.

 

Finally, I tried the MODIFY statement of the data step to update the (recreated) dataset BIG in place:

data big;
modify big ta;
by id;
run;

This step took so long that I terminated it after about 15 minutes. According to what it had accomplished up to this time, it would have taken about 8400 s to finish.

 

So, bottom line is (as expected): Run time depends heavily on the update method used.

 

@Maicemuallem: How does your PROC SQL step look like (not in detail, just the structure)?

Maicemuallem
Calcite | Level 5

That was very helpful, thanks

LinusH
Tourmaline | Level 20

If you are updating all records of the table, you may find recreating the table will be faster.

This with a data step, or a SQL create table as select from...

Is it a SAS data set/table?

Any indexes/constraints in the equation?

Data never sleeps

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 3297 views
  • 3 likes
  • 5 in conversation