Re: what can be used instead of proc update for a big data set?

Maicemuallem · Posted 01-21-2016 09:10 AM

when i use proc update to update 12 fields in a 5.5M data set, it takes a long time.

Any suggestions to replace proc update?

ChrisHemedinger · Posted 01-21-2016 09:12 AM

Do you mean PROC SQL with the UPDATE statement? You might want to share some more details about your data scheme and how you're having to update records within it.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!

Maicemuallem · Posted 01-21-2016 09:23 AM

yes. It is a proc sql.

my data has 475 variables and 5.5 Million observations. I am interested in updating 11 variables for only 65,000 observations

RW9 · Posted 01-21-2016 10:36 AM

I don't have anything to test this on, but basic datastep has an update statement as well:

data want;
  update <master_dataset> <update_dateset>;
  by <id_variables>;
run;

This should run far quicker than SQL - the SQL in SAS tends to be quite resource intensive.

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202975.htm

FreelanceReinh · Posted 01-21-2016 01:48 PM

I created test data:

dataset BIG with 475 variables (one numeric ID with length 8 and 474 character variables of length 1, populated with random letters) and 5.5 million observations (sorted by ID, file size 2.5 GB)
dataset TA (for transaction) with 65000 obs. and 12 variables: ID with randomly selected values from BIG and 11 randomly selected character variables from BIG, populated with '#' characters and about 2% missing values, also sorted by ID.

Then I let a data step with UPDATE statement, as suggested by @RW9, and an equivalent PROC SQL step (using a LEFT JOIN and the COALESCE function, not an UPDATE statement) run three times each to create the updated dataset WANT (or WANT1, resp.).

Run times (in seconds):

Data step: 5.09, 5.52, 5.64

PROC SQL: 9.03, 9.47, 9.41

The difference is highly significant (t test, p<.0001).

For another comparison, I updated BIG directly ("in place") using PROC SQL's UPDATE statement and the WHERE clause

where id in (select id from ta);

With (hard-coded) constant transaction data ('#') this was faster (average run time 4.4 s), but with a subquery pulling the real transaction data from dataset TA for only one of the 11 variables to be updated, the run time went up to about 87 s and would probably have been much higher, had this been done for the remaining 10 character variables as well.

Finally, I tried the MODIFY statement of the data step to update the (recreated) dataset BIG in place:

data big;
modify big ta;
by id;
run;

This step took so long that I terminated it after about 15 minutes. According to what it had accomplished up to this time, it would have taken about 8400 s to finish.

So, bottom line is (as expected): Run time depends heavily on the update method used.

@Maicemuallem: How does your PROC SQL step look like (not in detail, just the structure)?

Maicemuallem · Posted 01-21-2016 04:02 PM

That was very helpful, thanks

LinusH · Posted 01-21-2016 09:17 AM

If you are updating all records of the table, you may find recreating the table will be faster.

This with a data step, or a SQL create table as select from...

Is it a SAS data set/table?

Any indexes/constraints in the equation?

Data never sleeps

what can be used instead of proc update for a big data set?