12-02-2014 10:03 AM
I am trying to fine tune a SAS code that is taking close to 9 hours for completion. The code is a SAS macro . This macro runs in a loop - it runs 15 times for every record, there are over 7 million records. It has over 110,000 lines of code. The macro has lots of labels with if and goto statements.
Refer the same code from the macro:
if TEST <= .z then goto A0_7;
if TEST < 34343.7 then goto A0_2;
else goto A0_7;
if TEST1 <= .z then goto A0_6;
if TEST1 < 343.22 then goto A0_3;
else goto A0_6;
Entire macro has been completely built with codes like this (if statements, goto statements and labels). I believe it has been used to achieve some kind of statistical imputation.
Can someone give any idea on what approach can be taken to fine tune a code like this?
Thanks in advance.
12-02-2014 10:35 AM
Delete it and start again. Although if you don't actually know what its for - "I believe it has been used to achieve some kind of statistical imputation." - not sure how you would start. Its a general problem, I see it a lot. "Here's a whole load of macros someone wrote back in 1970 on SAS1 with no peer review or documentation, have fun trying to get it working."
Maybe check your input datasets, work out what the 'TESTs' are, you should be able to make all those ifs and gotos disappear by putting it in a datastep. However 110,000 lines of code and no documentation -> look for the highest window.
12-02-2014 10:56 AM
In a nutshell, you are probably looking at statistical model that uses a branching approach. There is probably little you can do to speed it up, but here are a few ideas.
The DATA step can probably be pre-compiled and saved. That won't make much difference ... probably 1 minute. It's a better strategy when you are scoring just a handful of records.
Get faster hardware. 9 hours seems like a lot for 7M observations, even when repeating 15 times.
Examine why the code needs to repeat 15 times for each observation. There may be some shortcuts to apply there (at least it seems possible).
The RW9 approach (delete it and start again) is actually a possibility. Software that develops this sort of code can be tailored to limit and/or remove some of the tree branches so you could end up with fewer lines of code.
12-02-2014 11:08 AM
Thanks for the suggestions.
One of my colleagues suggested that the model code can be run inside a database (greenplum in this case) using PROC HPDS2 procedure to improve the performance. I am not quite sure on how labels, if conditions, goto statements can be rewritten inside PROC HPDS2 as I am quite new to AA environment. Can someone help me with some link/references that tells how normal SAS statements can be put inside PROC HPDS2 procedure? Thanks.
12-02-2014 11:53 AM
I think your colleague just hit the nail on the head. Greenplum along with Teradata, Hadoop is one of the few database that supports SAS high performance procedures. If you can do in-database process, then do it in a heart beat.
This has a chapter on Greenplum:
This is about HP STAT:
This has a chapter on HPDS2:
12-02-2014 12:35 PM
Thanks Hai Kuo. I will start deep diving in to the links. Before I do that, I have one question. Is PROC DS2 different from PROD HPDS2? I have many reference codes available with PROC DS2 within our system itself. Not sure if DS2 does the same as HPDS2. If you could clarify, it would be helpful for me to understand things in more relevant manner (to our system)
12-02-2014 01:11 PM
In term of performance, my impression (could be off) is that if you run them on a single machine, the difference between these two is minimal. However, if you are running it in a so called distributed mode, for example, in-database processing in your case, HP one will edge out, but make sure you have the license for that (single machine mode does not require additional license AFAIK). Code wise, I would assume they are the same, but I don't really know.
12-02-2014 02:28 PM
The code you are supporting is possibly a statistical simulation which could be doing probabilities. If this is the case then you can get statistically valid results by sampling your base data instead of doing every row. This would be way faster.
Also PROC HPDS2 is very new to SAS and in some cases users have reported it is slower than base SAS for the task they want to do. You would have to code and test very carefully to ensure that it is indeed faster and gives you the same results.
Another possibility could be to run your SAS process in parallel across different chunks of your 7 million records. For example, SAS process 1 could do the first 3.5 million, SAS process 2 could do the second 3.5 million. They would both run at the same time, and then you combine results when they are both finished. This sounds to me worth checking out as it requires minimum code change. You could run 4 or more processes in parallel to speed things up even more.
12-02-2014 03:28 PM
I kinda figured out how PROC HPDS2 works from the reference materials that you provided. Thanks again for that. However, I see that the HPDS2 procedure works even if the destination library happens to be a Unix directory instead of a Greenplum schema. To see the performance improvement, should it be run only with a Greenplum destination or even a local destination would be sufficient? Is it the procedure or the database that does the trick w.r.t performance? Also I don't have a hold on the variables that are being used in the model. So I cannot use DCL statement with all the variable names and their corresponding types. How can that be addressed in the procedure?
I am yet to apply the procedure on actual code. Just wanted to make sure of this before I start making changes to my code.
12-02-2014 02:54 PM
If tis is code resulting from the scoring node of Eminer. Think on tranlasting it in an other language (C/java) or using in database scoring as accelerator.