Re: Efficiency Intuition help: Few small steps vs One mega step

novinosrin · Posted 08-20-2018 05:48 PM

Efficiency Intuition help: Few small steps vs One mega step

I would like to seek your opinion and guidance on how to foresee or think through the objective of writing a program that executes the fastest and efficient manner. Sometimes, the assumption for many is to program all in one step as opposed to otherwise mentioned in the subject.

Of course I understand this topic is highly dependent upon the quality of the programmer, however when taking relatively simple examples, it sometimes baffles me that small "n" many steps executes as whole much faster than "n=1" step. How do I gain this intuitive ability in the first place before writing both forms and testing it to see - "Oh well, is this all it does after all"

Note, @SAS L Hall of famers et al and top users, your programs are excluded from this comparison study coz I know your one step solution will likely win 99% of the times, nevertheless this is very intriguing.

art297 · Posted 08-20-2018 07:59 PM

@novinosrin: It would help if you can post an example where many small steps works faster than having everything in one step.

One way in insure that times will increase is to read data more times than necessary. That is what typically occurs when one programs with multiple steps, rather than doing everything in a single step. Additionally, IMHO, breaking a program into multiple steps makes it harder for someone else to follow what you've done.

Art, CEO, AnalystFinder.com

Kurt_Bremser · Posted 08-21-2018 03:18 AM

Personally, I prefer to work in steps. This is also covered in Maxim 34.

My object lesson for this came early when I reviewed a consultant's code that merged 5 normalized tables into one dataset for analysis. His code contained > 10 separate sort and data steps. My first thought was "I can do that in one easy SQL!", and so I did that. My nice&easy SQL took 4 hours to finish, his two-page sort/merge sequence just 20 minutes.

OTOH, one can hare off into the wrong direction:

data Asia;
set sashelp.cars;
where Origin = 'Asia';
run;

data Europe;
set sashelp.cars;
where Origin = 'Europe';
run;

vs.

data
  Asia
  Europe
;
set sashelp.cars;
select (Origin);
  when ('Asia') output Asia;
  when ('Europe') output Europe;
  otherwise;
end;
run;

The second code will clearly outperform the first, and splitting the dataset is one functional unit, so it obviously does not violate Maxim 34.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 08-21-2018 06:33 AM

I would pick up " Few small steps " .

"One mega step" would make code very complicated and lead you to more ERROR.

The RIGHT result is first.

LinusH · Posted 08-21-2018 10:52 AM

Programming is not scientific, it's an art. So therefore there will never be a clear correct answer.
Personally I first see to reduce no of steps, as long as the logic is somewhat clear. Where to draw the line is up to you, and perhaps the maintenance people who will live with the developed solution.

Data never sleeps

novinosrin · Posted 08-21-2018 01:07 PM

Very very interesting point of view/comments from all of you. I am hoping and expecting a few more to chime in their views and draw inferences from opinions.

Here is my Professor's view I am sharing coz he isn't on this community nor a sas user:

1. Construct/Building block should be small to very small if broken into few small steps. This could challenge heavy looping in mega steps

2. Only use mega step if it;s not possible to have small to very small building blocks, coz in essence the effort is the same.

Example,

1. In instances, like sort, create grouping var and then do proc summary/sql (2 steps) easier to understand with very short programs

2. Or manage with array feeds, hash instances(Hash of hash eg) etc and of course making it look brilliant and classy although intimidating for intermediates

The 2nd will of course be the charm of greats like all of you who have chimed in here as I am well aware of your expertise however just looking through the lens of semi technical audiences makes me wonder how much of that is appreciated by those folks as a whole.

Thank you for your time!

ChrisNZ · Posted 08-22-2018 02:01 AM

1. A multitude of tiny steps is not necessarily easier to read than fewer larger steps.

2. From a performance perspective, I/O is usually the killer in BI jobs as @LinusH said, and so fewer steps are generally faster as long as the I/Os are kept sequential (which was probably not the case in @Kurt_Bremser's SQL example) as they limit I/Os.

3. Sometimes, one has to weigh legibility vs performance. If one optimised, sequential step takes 12 hours to run, maybe it should be kept whole even if harder to decipher due to more convoluted data manipulation than what simpler and easier-to-read steps could do in twice the time.

4. It all depends. To quote@LinusH: "It's an art"

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 08-22-2018 02:16 AM

I even have a Maxim for that (30). One can have easy-to-read code that is very inefficient if that step still doesn't take more than 20 seconds or so. If you can shave off hours (or, in some environments, even just minutes) by engaging in some less-easy-to-decipher coding, then you probably want to do it.

In my original example, the SAS SQL strategy of throwing everything into the utility file and working off that with lots of random reads caused the WORK disks to slow dosn to a crawl. Since the sequence of sort/data steps engaged mainly in sequential read/write operations (and one can even split the load over disk arrays by using physically separate libraries when you do it manually), it outperformed SQL by orders of magnitude.

Keeping a sequence of steps legible is also part of the "art". In my experience, I've found it easier to grok what separate steps do. I absolutely hate to read myself into those complex SQL queries that involve multiple, often nested, subqueries. The usual straightforward style that I find in our host PL/1 programs OTOH is easy for me. On top of SAS and PL/1 sharing the basic syntax.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

LinusH · Posted 08-21-2018 01:50 PM

From a logical perspective I can agree. But the recommendation seems generic and perhaps optimal for application development outside the BI domain. In ETL, IO is the primary area of performance tuning.

Data never sleeps

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away