02-08-2013 09:44 AM
I'm a new SAS user, and I've inherited a bunch of code. I've noticed that running many programs take a very long time to run, but utilize almost no CPU or RAM (which I have plenty of). I think much of the code could be rewritten in a more efficient manner, perhaps some will have suggestions?
Here's an example of a macro called for a very large dataset (440 fields, approx 6.7gb uncompressed):
%let dt = %Sysfunc(InputN(01JAN2005, Date9));
%let currdt = %Sysfunc(InputN(&currdt, Date9));
%let loop = %sysfunc(intck(month,&dt,&currdt));
%do i=0 %to &loop;
%let dt2 = %Sysfunc(INTNX(MONTH,&dt,&i));
%let year = %Sysfunc(year(&dt2));
%let mth = %Sysfunc(month(&dt2));
%let mth2 = %sysfunc(putN(&mth,mon.));
%let mth3 = %sysfunc(putN(&mth,z2.));
%let st&i = DATEFIELD_&mth2._&year;
%let tm&i = &year.&mth3.01;
RESULTDATE = .;
%do j=0 %to &loop;
if(&&st&j = '2') then do; RESULTDATE = input("&&tm&j",yymmdd8.); goto out; end;
format RESULTDATE yymmdd10.;
There are 3 other macros run immediately after this that are identical, except each looks for a "3" or "4" or "5" and respectively puts the result in a new date field.
What are some more efficient techniques for achieving the same result? I don't quite know how the buffer in SAS works, but maybe I can check for 3,4,5's while the record remains in memory, as opposed to re-reading the complete file from the HD for each macro run? Cutting the fields down would also definitely help, as I imagine would putting the results into a new dataset (with only a key and the results) and then joining the results onto the the main dataset, so the main file is only updated once.
What would be the best way to apply an update, given a transaction set? Normally with SQL I'd use an inner join in my update statement, but SAS doesn't seem to support this? and the suggested work around:
runs a lot slower than I'd like.
I haven't tried it yet with this macro, but elsewhere I've noticed that compressing the dataset really helps performance. Any other recommendations? I have 16gb of ram to play with, and SASFILE looked like an interesting feature, but I'm very new to SAS and haven't quite figured out how best I could possibly implement this feature.
02-08-2013 10:09 AM
Seems like you could combine the 3 other macros into the one update. That way, you're only reading the data set once. If you have enough disk space, I think it would be faster to create a new dataset rather than editing the current one in place. SAS is probably having to do a lot of shuffling to maintain the row when you add variables..and then you do it 4 times.
02-08-2013 10:26 AM
Right. I think there's a lot of improvement that can be done. This code is just a small sample of a much larger program, but its sort of emblematic of inefficiencies in the code.
One thing about the process, the result field is not being added here, this is run every month and should just be an update from last month's value. In this case, would it still be more optimal to create a new dataset and delete the old one?
02-08-2013 10:54 AM
If the RESULTDATE already exists in the data set, then SAS isn't doing what I thought it was. Best practice would be to have what ever updates your going to make happen in one pass if possible.
02-08-2013 10:27 AM
You have pinpointed the reason for I/O being high. Each call to the macro has to read and write the entire data set. The calculations are not that intense. If I inherited that code, I would definitely take DBailey's advice: have the macro generate loops, but restructure it so that all three sets of loops could be programmed in one DATA step. Unfortunately, being new to SAS, I think that approach might be too complex. Another approach would suit you better.
Generate a copy of the program, without any macro language. The first step to doing that is to make sure the correct option is in place:
In that case, the simple but tedious method to do that would be to edit the .log file. The more complex but automated way would be to make sure this option is also in place:
Then define MFILE to be a file that will hold the generated SAS code. (It will take a little bit of study on your part, but not a whole lot, to put this in place.)
Once you have the SAS program that a single call to the macro would produce, it would be fairly straightforward to repeat/edit some blocks of the code so that a single DATA step would accomplish the task.
Like I said, I wouldn't do it that way. But when both SAS and macro language are new, you need to do this to go through the learning process. You have to be able to envision what the SAS code will look like if you're going to write macros that generate the proper SAS code.
02-08-2013 10:36 AM
Thanks, I didn't realize there was an automatic way of regenerating the code without macros. Knowing that is definitely helpful.
Here's another question: Even when this is a simple update, is SAS really rewriting the entire dataset?
This would seem terribly inefficient, I thought the point of defining variables (fields) was so that space could be allocated for them, so the whole dataset didn't need restructured.
02-08-2013 10:47 AM
Yes, SAS is rewriting the entire data set. Technically, re-using the data set name does not affect I/O or space requirements. SAS will write the newly generated version to blank space on disk and, assuming that the DATA step completes successfully, will then designate the new version as the right one and the old version as reusable space. So the space available once the DATA step completes will be affected, but not the space requirements to run the program.
Technically, SAS does have a couple of ways to update a data set in place (some in PROC SQL, and the MODIFY statement in a DATA step). If you are only changing the value of RESULTDATE each time, it is possible that MODIFY might help. I'm just not familiar enough with it to say. (You also indicated that there is more to the macro than what we see here, so it is possible that the other functionality would preclude using MODIFY.)
02-08-2013 10:43 AM
I'm not sure but it sure looks like this is an awful lot of work to parse the variable name and create a date that matches from the name. Does that sound like what this code is supposed to accomplish?
02-08-2013 04:00 PM
You need to figure out what that monster is actually trying to do. Then you could probably do it without any macro code at all.
It looks like you are scanning through a series of variables and looking for specific value and when it is found the program sets some value into the newly generated variable. It is not clear to me what that value you is, but it feels like it is the date that was built into the variable name.
This is a process that should use an array. Then the logic with the INTNX can be used to calculate the indexed into the array. It looks like your variables are named such that it might require extra effort to generate the names of the variables to include in the array.
But let's say you had nice variable names then your code might look something like this.
* Make some dummy data ;
array months DATEFIELD_200501 DATEFIELD_201205 DATEFIELD_201301 ;
input _n_ @@;
1 2 3 2 1
array months DATEFIELD_200501 -- DATEFIELD_201301 ;
RESULT_DATE = . ;
format result_date yymmdd10. ;
do i=1 to dim(months) while (missing(result_date));
if months(i) = 2 then
result_date = input(substr(vname(months(i)),11,6)||'01',yymmdd8.)
DATEFIELD_ DATEFIELD_ DATEFIELD_ RESULT_
Obs 200501 201205 201301 DATE i
1 2 . . 2005-01-01 2
2 . 2 . 2012-05-01 3
3 . . 2 2013-01-01 4
4 . 2 . 2012-05-01 3
5 2 . . 2005-01-01 2