Re: Best practice for log error handling

brulard · Posted 06-01-2017 11:44 AM

Hi,

If anyone has experience with report log error handling, in the context of daily automated reports, i would appreciate your feedback.

Goal:

-efficient error handling mechanism requiring a minimum of user intervention.

From my online search, here is an option:

-run a daily program that reads each job logs. The program reads all the logs for key words, such as Error, which it then compiles in a report. I could then run a program that if error is present, to email me an alert..

here is document on this i found useful: http://analytics.ncsu.edu/sesug/2008/CC-037.pdf and

http://www.lexjansen.com/pharmasug/2008/cc/CC02.pdf

thanks,

Tom · Posted 06-01-2017 11:52 AM

Those are good examples of searching generic SAS logs for text that indicate problems.

But if you have daily reports you might want to enhance that by adding logic into the reports to trap and report errors that you can anticipate. For example if a file that it needs to run is not available your report program should report that in the log.

In the past I have found it useful to use a prefix on lines written to the log that document both errors/issues but also document milestones in the process. So you might prefix errors with one string, warinings with another and informational/note items with a third.

Quentin · Posted 06-01-2017 12:21 PM

My favorite log scanning paper is: http://www.lexjansen.com/nesug/nesug01/cc/cc4008.pdf.

The key insight I like from that paper is that with regard to processing NOTE messages: "it is safer to specify a list of messages to exclude from the report than it is to specify a list of messages that are the only ones to be reported."

My current approach for handling log scanning of daily reports:

Each program ends with a call to a %logcheck() macro
1. This scans the current log file, and if it finds errors/warnings/unexpected notes, it emails me a notification. Emails a report like http://bi-notes.com/2012/04/sas-eg-check-the-log/ .
2. It appends a single record to a JobLog dataset, which has the name of the job, count of errors/warnings/unexpected notes, run time, etc.
3. Note that for this to self-scanning approach to work, you need to have code in logcheck to recover from syntaxcheck mode. Because if there is an error in the main code, and SAS set obs=0, you still want your logcheck to execute. I use: options obs=max replace NoSyntaxCheck .
Every morning a summary job runs that processes the joblog dataset, reading all jobs that ran in past 24 hours.
1. It compares those jobs to a dataset of ExpectedJobs, to catch any jobs that should have run but did not start (or hung).
2. It sends an email with a summary of that night's runs (# of jobs completed, # of jobs with bad log messages, etc).

This has been working well. If all goes well, every day I get one email saying N jobs ran fine. If something fails, I get an email from each failed job as well as the summary.

The only catch has been it's all dependent on the morning summary job running. If that job fails, or if the server where my jobs run is completely hung, I won't get any notification. And have to rely on my dumb mind to notice the absence of the daily email. I suppose a way to avoid that would be to have a non-SAS tool check the joblog dataset, but I haven't bothered with that.

HTH

Kurt_Bremser · Posted 06-01-2017 12:30 PM

Our solution looks completely different.

All jobs are run by the scheduler, which sets the "OK" condition only for jobs that exit with RC=0. Any other return code triggers an alert, upon wich the datacenter people react as written in the documentation (alert the responsible person immediately, on the following day, or on the following workday). Chains of jobs will only continue when all necessary predecessor jobs have finished successfully.

The SAS programs will run without WARNINGs and ERRORs when successful, the logs will be as clean as possible (no type conversion NOTEs etc). On top of that, the shell script that interfaces between the scheduler and SAS will scan the log for certain character/word sequences to find possible aberrations, and issue custom return codes if something is found. Since all logs are individually named and kept, research in case of unexpected problems is made quite easy.

So you see, we are much more proactive than just a periodic text scan for ERRORs.

We wouldn't be able to keep track of 1000+ jobs running in batch with just two SAS developers, otherwise.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Quentin · Posted 06-01-2017 01:09 PM

I'm curious @Kurt_Bremser, if a program throws a bad NOTE: , i.e. type conversion or unitialized variable or whatever, which part catches that? Does the SAS job reliably catch all the bad notes and set RC to non-zero? Or would that be caught by the shell script that does the log scan?

The rules/variety of the return codes provided by various SAS modules vary so much, I've never paid enough attention to them. Typically anything that triggers a bad RC in &syserr or whatever would also trigger a bad log message. And it's easy enough to throw custom bad log messages. So for my needs, having each job end by determining the number of errors/warnings/bad notes is enough of a return code.

Kurt_Bremser · Posted 06-01-2017 02:21 PM

In case SAS returns with 1 or 2 (WARNING or ERROR), or any other non-zero, that's it. If it returns 0, I scan the log file with an extended grep that searches for phrases I want to catch. Some will cause an automated rerun (with a limiting counter, of course), others will set a custom return code.

The SAS code itself has macros that will react to certain conditions (missing but required parameter, bad settings for remote input files, ...), and other checks (ie too small number of records in infile), which lead to an abort abend with custom return codes. Some of these codes enable the datacenter operators to correct mishaps on their own.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Quentin · Posted 06-01-2017 02:46 PM

Thanks @Kurt_Bremser. I can see how the return codes could be useful for communicating to datacenter operators, and having them try to assess/fix. Since I'm typically the only person who will be doing such investigation, sending myself an alert email with a snip from the log is typically enough. And I don't have thousands of scheduled jobs (not even hundreds : )

Kurt_Bremser · Posted 06-01-2017 03:20 PM

It is all depending on the size of your SAS operations, and the infrastructure already in place. Since we already had Control-M for all our mainframe ops, and the necessary people running it, we "only" had to build the necessary interface (scripts and logical definitions, ssh connection) to run UNIX programs from mainframe JCL scripts.

Before integrating the UNIX data warehouse into the MF job control, I used a combination of cron entries and makefiles to build the dependencies.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

brulard · Posted 06-02-2017 04:43 PM

Everyone,

thank you very much for all of your feedback

Patrick · Posted 06-03-2017 09:54 PM

@Quentin

I'm very much with @Kurt_Bremser that using a scheduler and have the scheduler take actions based on return codes is the "best" way of implementing job control. As the scheduler executes as the parent process it also will deal with cases where the SAS program runs into such a bad situation that it wouldn't send an email anymore.

It's a shame that we can't configure SAS return codes on a granular level and that there is often a need to post-process SAS logs to capture NOTE messages which we consider should be a Warning or even Error.

What you could do is implement such SAS log post processing in a single place and if the is a match have this code throw a Warning or Error. If you then call this code via the TERMSTMT option in your batch command then you could still capture the return code via scheduler and though have a simple and single approach to batch job alerting for all your scheduled jobs.

http://support.sas.com/documentation/cdl/en/lesysoptsref/69799/HTML/default/viewer.htm#n0rjd82dx13qi...

...and: I you take such an approach where your SAS log analysis feeds the return code directly back to the scheduler then you can also implement job dependencies like only running dependent jobs if the previous one didn't end with a return code of 0 (and you can set a different return code also already via SAS log post processing).

Quentin · Posted 06-03-2017 10:11 PM

I agree, it's a shame that we can't make SAS return codes more sensitive.

But living with that, since we agree that it is often necessary to resort to log scanning to catch bad notes, then I don't really see the benefit of checking return codes at the end of a job AND doing log scanning. Are there situations where a "bad" return code is set which don't throw bad log messages that would be caught by a log scanner?

Suppose a log scanner creates macro variables with the count of bad NOTEs, WARNINGS, and ERRORS, or just a single macro variable with the sum of those counts. Maybe that's the best return code.

I don't see a benefit to having the scheduler send an email when a job ends with errors instead of the SAS session. Unless the scheduler can be configured to send an email if a job doesn't complete within a certain time frame. So that if the SAS session was hung, the scheduler could send an email, or kill it and resubmit.

I'm certainly not arguing against schedulers. Our BI server uses LSF, and it's configured to send emails.

--Q.

Patrick · Posted 06-03-2017 10:33 PM

@Quentin

Are there situations where a "bad" return code is set which don't throw bad log messages that would be caught by a log scanner?

There are situation where you only get a NOTE for something where you'd like SAS to set a return code of Warning or Error.

If you post process the SAS log via TERMSTMT then you can throw a return code other than 0 directly as part of your job which allows you to implement job dependencies based on job return code (i.e. only run job 2 if job 1 ended with a return code of 0).

Quentin · Posted 06-04-2017 11:25 AM

@Patrick

Thanks, I agree with:

There are situation where you only get a NOTE for something where you'd like SAS to set a return code of Warning or Error.

I often use the undocumented dsoptions=note2err just to force some (almsot all?) bad notes into errors

Would you agree with:

There is no situation in whch checking a return code can detect a problem which could not be detected by checking the log.

?

If so, then it seems to me that checking the log may be the best way to generate a return code. Rather than use the mix of SAS provided return codes (&syserr, &syscc, &SQLrc, etc etc), several with indiosyncracies.

And yes, agree that a job return code is useful when you want to have scheduling logic that depends on job status. My main point is that creating you're own job return code from log scanning seems safer to me that relying on automatic job return codes.

brulard · Posted 06-06-2017 08:25 AM

Quentin: thanks for providing detailed description of you preferred method. I'm reading through the documentation you referenced and may attempt to implement it (we have no more than a hundred jobs to run. Two types of issues I'll be looking to flag: (i) the the server was taken offline during job run, and (ii) the source data was not loaded into the table that is being queried ).

Others: if your log checking method is different than the one referenced above, kindly provide a link to documentation, thanks

Registration is open

SAS Training: Just a Click Away