Solved: why my code is running very slow (Modify statement to update master da...

nbonda · Posted 10-30-2015 08:52 AM

Hi have a master dataset, to update this master dataset with transaction dataset I am using following code.

creating empty dataset (one time):

proc sql;

create table ma.master

( ID num informat=8. format=8. label="Iabel_invoiceid",

Created_DT num informat=mdyampm. format=datetime20. label="create_dt",

COMMENTS char(100) informat=100. format=100. label="comments"

);quit;

Now I want to update this master dataset with transaction dataset every day, I using following code

data ma.master;

modify ma.master tr.tras_data

by id;

IF _iorc_ eq 1230013 then

OUTPUT;

else do;

REPLACE;

end;

run;

I have 85000 records first time and every day i will have 30000 records in transaction dataset.

Its taking more than 2 hours to complete. Why its running slow Please suggest me if anything I need to modify my cody.

Thank you in advance

jakarman · Posted 10-30-2015 10:24 AM

"When you use a BY statement with the MODIFY statement, the DATA step uses dynamic WHERE processing to find observations in the master data set. Neither the master data set nor the transaction data set needs to be sorted. For large data sets, however, sorting the data before you modify it can enhance performance significantly."

http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001772743.htm
Where is your replace statement

Optimizing the dynamic where processing. Think of trying indexes (random IO)

---->-- ja karman --<-----

View solution in original post

RW9 · Posted 10-30-2015 09:28 AM

Why do you have all the additional if statements checking some automatic variable?

IF _iorc_ eq 1230013 then
OUTPUT;
else do;
REPLACE;
end;

It should just be an update by certain variables. You should perhaps read the manual section on updating datasets:

http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001304308.htm

jakarman · Posted 10-30-2015 10:24 AM

"When you use a BY statement with the MODIFY statement, the DATA step uses dynamic WHERE processing to find observations in the master data set. Neither the master data set nor the transaction data set needs to be sorted. For large data sets, however, sorting the data before you modify it can enhance performance significantly."

http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001772743.htm
Where is your replace statement

Optimizing the dynamic where processing. Think of trying indexes (random IO)

---->-- ja karman --<-----

nbonda · Posted 10-30-2015 01:21 PM

Hi jakarman,

How can I do indexing on master table.

Thank you

jakarman · Posted 10-30-2015 02:27 PM

Updating in place is very usefull when:

a/ having a big dataset and small number of updates
b/ Using it with SAS/share allowing views/updates form other places at the same moment
You can see the usage wiht Eguide / Base updating a dataset interactively.

The Creation of indexes can be done by:

1/ on a existing dataset using "proc datasets"
http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n0mfav25learpan1lerk7...

You can even define contraints and a audit dataset.

2/ as a datasetoption creating the dataset

http://support.sas.com/documentation/cdl/en/ledsoptsref/68025/HTML/default/viewer.htm#n1aaucjme18e43...

Sometimes is is better to remove all options and tor recreate them after having refilled all records.
Adding records on existing dataset wiht indexes will require all updates in the index on each record.

Performance on processing is knowing/understanding what will happen inside the egine room
(*typos... corrected)

---->-- ja karman --<-----

Tom · Posted 10-30-2015 11:16 AM

For a table with only 85,000 observations it is not worth the headache to implement trying to modify the table in place.

Just replace the table. Look into using the regular old UPDATE statement for applying your transactions.

data ma.master;
  attrib ID length=8 label='Invoice ID';
  attrib Created_DT length=8 informat=mdyampm. format=datetime20. label='Datetime Created' ;
  attrib COMMENTS length=$100 label='Invoice Comments';
  stop;
  call missing(of _all_);
run;

data ma.master;
  update ma.master tr.tras_data ;
  by id;
run;

nbonda · Posted 10-30-2015 12:39 PM

Tom,

Thank you for your reply. I will test update statement also. I checked the article update and modify are doing same function.

so is there any advantage using modify statement over update statement?

I am using Unix platform forgot to mentionin original question:

Tom · Posted 11-01-2015 07:56 AM

The advantages of modify include much more flexiblity and the ability to modify a data set in place.

Of course those are also the disadvantages since the flexibility makes it more complex to understand and modifying a dataset in place is also a more complex operation and potentially dangerous. So only use MODIFY if you need to.

Note also that UPDATE is very strange tool and you need to understand how it works. The intent is to allow you to make changes just to specified variables. So only the non missing values in the transaction records cause changes in the master records. (To change a value to missing you need to use the special missing ._ in the transaction dataset). If your intent in the transactions is to totally replace the master record then just use MERGE instead of UPDATE.

jakarman · Posted 11-01-2015 11:29 AM

Tom, The update approach modify in place is not weird. In contrary it is the basic reason for the RDBMS and SQL standard approach.
The SAS specific way handling data using SAS datasets in not commonly known and accepted. Using the PDV and automatic advancing is often bashed and asked to be replaced by SQL. Using the now buzzing words NO-SQL could help, but you are aksing fo a complete different mindset of somebody just starting to use SAS.

---->-- ja karman --<-----

Kurt_Bremser · Posted 11-01-2015 06:29 AM

Sort the data sets and use a data step with UPDATE. Then check the performance. With these small record counts, the operation should be finished in seconds.

What you run into is the fact that without sorting, the step scans all records of the master data set for every record of the update data set, basically causing 85K * 30K I/O operations.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Re: why my code is running very slow (Modify statement to update master dataset )

Registration is open

SAS Training: Just a Click Away