Recently I added a joking remark to one of my posts that included a not-so-hidden reference to the 70 Maxims of Maximally Effective Mercenaries from http://www.schlockmercenary.com/.
This earned me a question from @ballardw, and so I decided to start the project for earnest.
This is planned to be an ongoing work, in the tradition of the New Hacker's Dictionary and similar works, so feel free to contact me with suggestions for new maxims or enhancement of already existing maxims. Reordering is possible, but once a maxim has been used in an other place, I intend to keep its number static.
So - HERE THEY ARE, the
SAS provides extremely well done documentation for its products. Learning to read the documentation will enhance your problem-solving skills by orders of magnitude.
Everything you need to know about your program is in the log. Interpreting messages and NOTEs is essential in finding errors.
Having a clear picture of data structures – variable types, lengths, formats – and content will provide you with a fast-path to solving problems. Many simple problems can be cleared by taking a look at the "Columns" section in dataset properties. Use proc contents frequently.
SAS is in its core an interpreting language. Running one step is just a few mouse-clicks away. Use that to your advantage.
"Try it." (Grace Hopper, Admiral, US Navy)
SAS Technical Support and the SAS user community stand at the ready to help you. Provide a clear question, example data, your code (with or within the log, see Maxim 2), and where you seem to have failed. Help will be on the way.
Just entering the text of an error message or problematic issue, prepended with "SAS", will often yield resources with at least a hint for solving your issue on the first result page; the same is true for other search engines.
Learn to use prefabricated procedures for solving your tasks. 5 lines of proc means may equal 20 lines (or more) of data step logic.
As with Maxim 7, SAS provides a plethora of formats for input and output. Use them to your advantage. If one that fits your needs is not present, rolling your own will usually beat complicated data step logic.
Often OBS= was set to limit the number of observations for tests. This will cause seemingly unexplicable terminations of steps later on.
With large datasets, the way proc sql handles joins will lead to the buildup of large utility files, with lots of random accesses to these. This may well prove much less efficent than a join done with a data step and the necessary proc sort steps.
There may be no need for repeating code when using by-group processing can do the trick. The macro language is also not meant for handling data - like calculating dates - but for creating dynamic code. Instead of creating lists in macro variables, store them in datasets and use call execute from there. Do calculations in data steps and save the results in macro variables (for later use) with call symput.
Proper visual formatting makes for better code. Use indentation to make semantic blocks visible. What is on the same logical level, needs to be at the same column.
Avoid overlong lines, make a block of lines if necessary (that 80 character limit of old is not so bad at all).
Be consistent; the next one to maintain that piece of code might be your five-year-older self. Be nice to her.
Make frequent use of comments.
As long as you keep your ability and will to learn, you are alive. When you stop learning, you may not be dead, but you start smelling funny. Never say "I don't have the time to learn that now". The time to learn is NOW.
"Much have I learned from my teachers, more from my colleagues, but most from my students." - from the Talmud.
"How many times do I have to tell you, the right tool for the right job!"
(Montgomery Scott, Captain, Starfleet)
Never restrict yourself by a simple "Use XYZ" or "Don't use XYZ". If something is better solved with a certain procedure, use it. If you don't yet know it, learn it (see Maxim 13). If a 3rd-party tool is better suited, use it (think of DBMS operations; have them done in the DB itself). Leave operating system tasks to the operating system (see Maxim 15).
Make sure to gain knowledge about the environment in which SAS is implemented. Know the system's layout, its basic syntax, and its most important utilities. Especially UNIX is rich with tools that can and will make your life easier. Control the system, don't let the system control you.
Anything that is not properly documented, does not exist. It may look like it was there at some time, but the moment you have to deal with undocumented things (programs, data, processes), it is as if you are starting from scratch. The Jedi programmer spends 90% of her time documenting, and reading documentation.
“I should have written that down. - Dilbert”
A data warehouse needs different structures than databases, and it needs different structures than spreadsheets. Adapt your thinking away from normalized tables (redundancy is avoided in DBMS designs, but is typical in data warehousing tables used for analysis), and also away from cells (spreadsheets have no mandatory attributes for columns, SAS tables have). See a SAS table as just the technical entity it is, and model it according to the analytic need.
Do not use names of predefined SAS functions, call routines or formats for your objects (variables, datasets). This avoids confusion and prevents hard-to-trace errors when a simple typo happens.
Avoid dynamic variable names that contain values (ie dates). Instead put the value into an additional variable and transpose to a long dataset format.
ID salary_jan salary_feb salary_mar
ID month salary
with month containing "jan", "feb", "mar", and salary the respective values.
A wide format should only be used for final displays.
Dynamic variable names force unnecessary dynamic code.
Short, concise variable names make for quicker typing with less risk of typos, and make the code more readable. Put additional information in labels, where it belongs.
Creating a format from a dataset lets you do a "join" on another dataset in one sequential pass. As long as the format fits into memory, it provides an easily understandable means to avoid the sort of a large dataset.
When reading from external sources, don't rely on proc import. Instead use a data step that will fail with an error if an unexpected event in the input data happens. This lets you detect errors early in your analytic chain. Maxim 22 specifically precludes using the Excel file format for data interchange, because of the many involved automatisms.
See Maxim 23.
No batch program can be allowed to end a succesful run with a non-zero return code. No WARNINGs, no ERRORs. Any unexpected event shall cause a non-zero return code, so a chain of jobs is halted at the earliest possible moment.
The log of a production-level process should be free of any extraneous NOTEs. Automatic type conversions, missing values and so on must not be tolerated.
This allows for easier detection of semantic problems or incorrect data, as these will often cause unexpected NOTEs.
While it is possible to keep code in catalogs, storing it in simple text files is better. SAS catalogs are specific to SAS versions and operating system environments. Text files are never version or operating system specific, eventual changes in codepages are handled by SAS itself over the IOM bridge or by file transfer utilities.
Text files also enable you to use external tools for text handling (think grep, awk, sed and so on).
In the same vein, always store the code of proc format (and its related cntlin datasets), and do not rely on the created formats in their catalogs.
Use plain text whenever possible. Use simple file formats for data transfer, so you can inspect the data with a text editor. Prefer modern, XML-based files (even when compressed, like xlsx) over older, binary formats. Store your code in .sas files, use Enterprise Guide projects as temporary containers only while developing. Text lends itself well to versioning and other tools that help the programmer, binary files don't. See Maxim 26.
Keep in mind that '01jan1960'd and 0 represent the same value in SAS code. Therefore you only need formatting of macro variables when they are meant for display (eg in a title statement). When you only need them for conditions, the raw values are sufficient and easier to create.
Use simple, easy-to-understand code. Only revert to "clever" algorithms when circumstances require it. This is also a variant of the KISS (Keep It Simple, Stupid) principle.
Do not engage in optimizing before you know where the bottleneck is, or if there is a bottleneck at all. No need to speed up a step that takes 1 second in real life. Keep your code simple as long as you can (see Maxim 29).
Never trust a piece of software to do on its own what you intend it to do. If you need a certain order, sort or use "order by" in SQL. If a variable should be of certain length, set it. Do not rely on the guessing of proc import, roll your own data step. (see Maxim 22)
Even if things work in a certain way on their own now, this does not guarantee anything for the future. New software releases or changes in data might wreck your process. Take control.
When making a change on an existing piece of code, never release it without running a regression test against data from the previous version with PROC COMPARE. This makes sure that only the changes you wanted (or in the case of an optimization, no changes at all) are introduced. Even use COMPARE at crucial steps in your development process, so you catch mistakes when they are introduced, instead of having to search for that spot later. If COMPARE takes its time, enjoy your coffee. It's time well spent.
The Jedi programmer strives for intelligent data structures. The code will then be intelligent and well-designed (almost) on its own.
"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
When dealing with time-related data, use SAS date and datetime values.
Categories are best kept in character variables.
Automate the creation of formats (eg from domain-tables in the DB system). Use these formats instead of select() blocks or if-then-else constructs. See Maxim 8.
Create control datasets and use call execute to create code dynamically from them.
Keep (especially non-ASCII) literals out of the code (by putting them in data); this will avoid problems when code is moved across systems.
Use properly structured formats for dates when the sequence matters (YYMMDD instead of MMDDYY).
Don't solve a task in one everything-including-the-kitchen-sink step. Create intermediate datasets/results, so that each step solves one issue at a time that can easily be comprehended (see Maxim 29).
In the same vein, don't implement multiple changes to a program in one iteration. Instead do a check-in for every single logical change. This allows regression tests for every change, and if something breaks which was not immediately obvious in regression testing, it is easier to step back and find which change introduced the bug.
#1 "This is old, and therefore it's good".
#2 "This is new, and therefore it's better".
Fads have come and gone, but every now and then there's a gem in all the rubble. Be critical, but open-minded.
The finishing step of every piece of work must include a test of the program in exactly the same environment it will be used (same server, same user, same batch script, etc). Only then can you be sure that the first use does not result in a call for support that interrupts your well-earned coffee break. "Normal" users may have a more restricted environment than developers, so keep that in mind when developing code for them.
Perfection is attained not when there is nothing more to add, but when there is nothing more to remove.
(Antoine de Saint-Exupery)
Another application of the KISS principle; also think of the practice of "muntzing" applied to programming.
Get rid of unnecessary code, redundant variables, irrelevant observations.
"Perfection is not attainable, but if we chase perfection we can catch excellence."
And heed the words of his mad prophet Murphy:
"What can go wrong, will go wrong".
Write your code in this light, and include safeguards in expectation of the unexpected.
Make sure that you are notified when the unexpected happens (eg make a batch program crash on invalid data, so the scheduler catches it).
(With regards to Larry Niven.)
Everybody makes mistakes, including you. Be kind when dealing out critique, as someone might be forced to critique you 5 minutes later.
There's always a better, and a worse. You're not the worst, but surely you're not the best. Learn from both, and be kind to both.
And remember that the code is not the coder. Very nice people can write incredibly ugly code, and vice versa.
Having a good line of communication with the ultimate consumer of data is essential.
Make them formulate their needs in clear, unambiguous language; often this act of questioning themselves provides a major advance in solving the problem.
Never solve a problem that you think the customer needs solved.
Deliver example results early and often.
Unfortunately, most experience emerges from stupidity.
A well-formulated question may already lead you in the right direction before you speak it out loud or write it down.
An ill-formulated question will get you the equivalent of "42".
Which information needs to end up where? The answer to this question shall guide your selection of tools and methods, not the other way round. Information is at the core of your work, and delivering it correctly, timely, and usable is your ultimate goal.
Once the content is delivered and accepted, you can think of "prettying up".
All this implies that the requirements have been refined and clearly worded (see Maxim 42). Have test cases defined against which you can prove your code.
All operating systems and programming languages use blank space as the prime separator of names and objects. It is therefore a VERY BAD idea to have spaces in names (of files, directories, columns and so on). Even when a certain environment provides a means to do so (encapsulating file names in quotes in UNIX and Windows, or the 'my name'd construct in SAS), one should not use that. Underlines can convey the same meaning and make handling of such names easier by orders of magnitude.
(The fact that certain products make happy use of blanks in file- and pathnames does not make this practice right; it just illustrates the incompetence accumulated in some places.)
There Ain't No Such Thing As A Free Lunch.
A big salute to the great Robert A. Heinlein for this perennial truth.
Good, durable code requires thought and hard work. Be prepared to provide both.
Every thing of value comes at a price, although the price is often not measurable in money.
"Fast. Cheap. Good. Select any two."
One tends to forget the fact that character variables have a fixed defined length, and values will always be padded with blanks up to that length. So it is always a good idea to use trim() when doing comparisons or concatenations.
if charvar in ('AAA','BBB','CCC');
will work even if the defined length of charvar is > 3, but
if findw('AAA BBB CCC',charvar) > 0;
will fail if charvar has a length of 5 and will therefore contain 'AAA ', which can't be found in the comparison string.
Similarly, charvar = charvar !! 'some_other_string'; will invariably surprise the unwary novice.
When creating new character variables, always set a sufficient length explicitly. When character variables are assigned literals, the first such assigment sets the length, causing confusion afterwards, because longer data will be truncated.
flag = 'no'; if (condition) then flag = 'yes'; if flag = 'yes' then …;
Make it a habit to always terminate macro variable references with a dot. It's never wrong to have the dot, but omitting it can have consequences and cause confusion.
libname myexcel xlsx "/path/excelfile&year.xlsx";