11-11-2014 09:12 PM
I have a job flow in LSF Process Manger which hangs on a certain job. We have been running the job for 6 months already and we just encountered this recently.
We have checked the following logs to see if there were errors encountered:
SAS Logs - the job has no errors but does not process the entire script. The other portions seems not on the log yet, making the node on the flow still appear as running.
Greenplum Logs - no errors on logs but the query tied to the job above is hanging on the queue. No activity as well in the Greenplum DB if the query is executing or not.
Any idea what more to check and identify what causes the problem?
Greatly appreciate your thoughts.
11-11-2014 10:33 PM
Does it always hang or is it intermittent?
Try running the SAS job without LSF, in SAS Display Manager or EG - same problem?
What step does it hang on - always the same one or a different one each time?
11-12-2014 01:13 AM
I have already attempted this steps as well. In SAS DI it works perfectly without any error.
Also, It hangs on the same step. In the greenplum active queries this is what I can see hanging which is part of the script: (modified table/column names for security reasons)
insert into "WRK"."TMP2"
(select "MAINSRC"."amount" from "WRK"."MAINSRC" as "MAINSRC" where "TMP1"."id" = "MAINSRC"."id" and "TMP1"."latestdate" = "MAINSRC"."txndate" order by "MAINSRC"."id" desc limit 1 ) as "latestamount",
from "WRK"."TMP1" as "TMP1"
11-12-2014 02:38 AM
if your job works on DI but not from LSF, I would focus for now in the differences between them.
First differences that you can find:
- Even though they share the binaries and configuration from the SAS base installation + the SASApp, DI runs with the Workspace Server (usually) and LSF runs with the Batch Server. I would check the configuration for both servers (metadata + configuration files).
- Check it they are running under the same SASApp. It is what I would expect, but that is not always the case... check to which workspace server do you connec with DI and with sasbatch command are you running with LSF (the easiest way is to check to which Batch server did you scheduled the Flow.
My guess: As you are working wth the Work, and Work can be configured to different location on the workspace and batch, maybe it is a good idea to check the configuration for the work you are using on the workspace server configuration and which work are you using on the batch. (sasv9 and sasv9_usermods).
11-12-2014 02:08 AM
Your open log (not ended to the last lines) is telling the SAS process is thinking that it is running fine. As no activity is seen it most likely waiting on something to complete.
As the log is not written to that last lines you have probably killed that sas process. To be sure that you can see the last lines in the log turn the caching in the logwriting off.
SAS(R) 9.4 System Options: Reference, Third Edition (logparm immediate)
As you are mentioning greenplum I expect that is a connect to that one. It could be gone into infinite wait on some resources to get that connection
11-12-2014 02:25 AM
Kindly correct if my understanding is wrong that in the back-end something might have killed the sas process or might have been disconnected?
Because the flow is automated to be triggered on a Saturday evening and no one issued it the job or flow to be killed. Thanks!
11-12-2014 02:35 AM
A possible scenario:
- Your SAS jobs starts ... at some point hello greenplum I have work
- Greenplum wait a moment I am busy
- SAS ok I wait
You do not see anything in the log on the sas-log dataset just a waiting process. The logging of what has happened can be in internal SAS buffer waiting for a moment it is good to write them. That moment is never coming as the SAS process is in an eternity wait. Forcing to write every logline will cause much more overhead (al that writing) but will show the last lines before the waiting event. No it will not show the event itself as that one is that is waiting.
There is noting killed less to be seen on broken processes. As you said greenplum logs shows nothing it did not come that far do something there. Why did it not start anything, no connection at all.