10-04-2012 10:59 AM
I'm new to DI studio. I have a simple job that runs a pass-through query against an Oracle database. It's a big query, so I expect it to run a few hours.
I scheduled it to run over night. So in DI studio I created a job, which became a flow, and I scheduled the flow in SMC. But the whole job is just the one step that runs pass-through query with user-written code.
This morning, I come in to see how everything went, and I'm not sure. Either the job aborted after an hour, or it's still running. : )
On the SAS server, I can see the log file from the job in ..../Lev1/SASApp/BatchServer/Logs
It is stopped right at the point of the pass through query, i.e.:
The log stops at the pass-through query: MPRINT(QUERY): proc SQL; MPRINT(QUERY): Connect to oracle(PATH=XXXXXX UID=XXXXXXX PASSWORD="XXXXXXXXXXXX"); MPRINT(QUERY): Create Table out.mydata As Select * from Connection to oracle ( select ...
And that is where the log ends. At the select statement. Which I might expect to see if the query were still running on Oracle, and had not yet been returned to SAS server. If I look for out.mydata on the SAS server, I see the temporary file, mydata.sas7bdat.lck. It looks like just a shell dataset, and mod time is the time the job started. Also what I would expect to see if job was in progress. I don't have access to the Oracle server to see if anything is running there. When I run TOP command on the SAS server (linux) I don't see any processes that look like this query.
The mod date on the log is almost exactly 1 hour after the job started, making me wonder if this somehow timed out on the (SAS server or on Oracle side) and was aborted.
Anyway, looking for suggestions on how to go about diagnosing this, particulary if there is a way for me to check the status of job from SAS server side, or if I need to call the oracle DBA and ask them to start reviewing those logs.
10-04-2012 11:07 AM
If you scheduled it via Operating System Services, you can check the scheduling server logs for any hints.
There are logs connected to your flow, in sub directory with a name containing your user id. Under another subdirectory (your flow name) you'll find a status log. If the job finished it should have a return code in the status log file.
10-04-2012 12:14 PM
Thanks for the suggestion.
We are using LSF.
I managed to find sbatchd.log.servername, which has a modified date of exactly when it looks like my process aborted, and this suspicious line:
7.06 lsbatchDir(): neither /.lsbatch nor spool dir could be used by the job , temp dir will be used
Not sure what that means, but it looks like this log doesn't get written to much, so will send this off to our SAS admin and see if it helps.
Would welcome any other suggestions.
10-05-2012 09:12 PM
The log indicates that your job hasn't finished yet. Also the suffix ".lck" indicates that your job is still running creating the SAS table "mydata.sas7bdat".
From my understanding what SAS does is first to create the SAS table in a temporary location and also create a .lck file in your final location. Once the temporary table has been succesfully created it will replace the .lck file and you end up with mydata.sas7bdat in the library you've chosen.
To check the status of LSF job here a few links which might be useful for you:
10-08-2012 04:56 AM
I think Patrick is on the right track.
Either way, I think you shouldn't be content with execution times like this, this is 2012 after all...
talk to the DBA and see if you could optimize your query in some way, or if any optimization in Oracle could be done.
You could try to narrow down your query, and try to figure out which part of it causing the long run time.
10-08-2012 08:48 AM
Thanks for the links, Patrick.
Agree, often when you see a .lck file it suggest the SAS job is still running. But you can also see .lck files remaining when a job has terminated in a surprising manner. The same way you can see files left in a WORK dir when SAS doesn't exit nicely. Pretty sure in this case the job had actually crashed. Will be playing with it more this week.
To Linus's point, am definitely working with the DBA. Of course, even in 2012, BIG data = long job times, or at least BIG data + slow servers + slow networks = long job times. : )
10-08-2012 10:01 AM
You are of course right that a killed job could also leave a log and .lck file behind as you see it. I'm not sure if a "top" command would show you the job as on the SAS side it will be mainly waiting for a result from Oracle.
From the top of my head I believe a command like: "ps -elf | grep <part of your log name>" could give you a better result.
I believe the PID of a SAS job under Linux is stored in the automatic macro variable &SYSJOBID. I don't have access to such an environment in the moment but I'll sure make a test next time if DI jobs print the PID automatically at the beginning of the SAS log - and if not then I'll start to add such a log message "%put PID is &sysjobid;" to all of my DI jobs in the future.
10-08-2012 12:36 PM
Thanks again Patrick,
Looks like "ps -elf" does show more useful information than "top." Will have to read up on PS to understand the output.... : )
I generally dump all the global macro variables to the log at the beginning of any job running on a BI platform. If only to astound myself with the vast number of global macro variables created.
Unfortunately &JOBID doesn't look like a PID to me. (But I'm pushing well past my linux knowledge). I get &JOBID=A5PKNKGJ.BQ000029 (With some quoted symbols around it, probably parentheses or quote marks). I'm guessing this is SAS job id used in SAS metatadata. There is a similar &TRANSFORMID. Don't see anythings that looks like linux pid.
10-08-2012 01:52 PM
PS command was very helpful. Screenshot of output is below.
The first 3 rows are a job I scheduled to start at 12:44 today.
The 4th row is the ps command, issued at 12:52, so I was 8 minutes into the job.
No idea what the Oct 5 process is...
Assume next 3 lines are my putty session/samba
The last 3 lines are the job that triggered by original question. The job I started that never ended. Looks like it's not taking any CPU, but it's still waiting. Oracle DBA doesn't see anything running, so I think this is a zombie process, suspect I'll have to kill it myself.
Also noticed the /.lsbatch directory above. Looks like this is a temporary directory used by LSF, and written to my home dir.
10-08-2012 07:45 PM
I believe it's &SYSjobid not &Jobid which would give you the PID.
10-08-2012 08:14 PM
Correct, you are.
Sorry for mis-reading your first post. I was looking at &jobid in _global_ scope. &SYSjobid in _automatic_ scope indeed has the PID, as you said. Good to know. Maybe I'll start dumping all the _automatic_ vars to the log as a matter of course as well. Reallly appreciate all your help with this.