Desktop productivity for business analysts and programmers

what can be used instead of proc update for a big data set?

Reply
New Contributor
Posts: 4

what can be used instead of proc update for a big data set?

when i use proc update to update 12 fields in a 5.5M data set, it takes a long time.

Any suggestions to replace proc update?

 

Community Manager
Posts: 2,889

Re: what can be used instead of proc update for a big data set?

Do you mean PROC SQL with the UPDATE statement?  You might want to share some more details about your data scheme and how you're having to update records within it.

New Contributor
Posts: 4

Re: what can be used instead of proc update for a big data set?

yes. It is a proc sql.

my data has 475 variables and 5.5 Million observations. I am interested in updating 11 variables for only 65,000 observations

Super User
Super User
Posts: 7,727

Re: what can be used instead of proc update for a big data set?

I don't have anything to test this on, but basic datastep has an update statement as well:

data want;
  update <master_dataset> <update_dateset>;
  by <id_variables>;
run;

This should run far quicker than SQL - the SQL in SAS tends to be quite resource intensive.  

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202975.htm

Trusted Advisor
Posts: 1,116

Re: what can be used instead of proc update for a big data set?

[ Edited ]

I created test data:

  • dataset BIG with 475 variables (one numeric ID with length 8 and 474 character variables of length 1, populated with random letters) and 5.5 million observations (sorted by ID, file size 2.5 GB)
  • dataset TA (for transaction) with 65000 obs. and 12 variables: ID with randomly selected values from BIG and 11 randomly selected character variables from BIG, populated with '#' characters and about 2% missing values, also sorted by ID.

Then I let a data step with UPDATE statement, as suggested by @RW9, and an equivalent PROC SQL step (using a LEFT JOIN and the COALESCE function, not an UPDATE statement) run three times each to create the updated dataset WANT (or WANT1, resp.).

 

Run times (in seconds):

Data step: 5.09, 5.52, 5.64

PROC SQL: 9.03, 9.47, 9.41

 

The difference is highly significant (t test, p<.0001). 

 

For another comparison, I updated BIG directly ("in place") using PROC SQL's UPDATE statement and the WHERE clause

where id in (select id from ta);

With (hard-coded) constant transaction data ('#') this was faster (average run time 4.4 s), but with a subquery pulling the real transaction data from dataset TA for only one of the 11 variables to be updated, the run time went up to about 87 s and would probably have been much higher, had this been done for the remaining 10 character variables as well.

 

Finally, I tried the MODIFY statement of the data step to update the (recreated) dataset BIG in place:

data big;
modify big ta;
by id;
run;

This step took so long that I terminated it after about 15 minutes. According to what it had accomplished up to this time, it would have taken about 8400 s to finish.

 

So, bottom line is (as expected): Run time depends heavily on the update method used.

 

@Maicemuallem: How does your PROC SQL step look like (not in detail, just the structure)?

New Contributor
Posts: 4

Re: what can be used instead of proc update for a big data set?

That was very helpful, thanks

Super User
Posts: 5,391

Re: what can be used instead of proc update for a big data set?

If you are updating all records of the table, you may find recreating the table will be faster.

This with a data step, or a SQL create table as select from...

Is it a SAS data set/table?

Any indexes/constraints in the equation?

Data never sleeps
Ask a Question
Discussion stats
  • 6 replies
  • 421 views
  • 2 likes
  • 5 in conversation