BookmarkSubscribeRSS Feed
Egrodrigues2014
Fluorite | Level 6

Hi Experts,

 

We have a batch process that executes every day. This week, a job that usually does not past 18 minutes of execution time (real time, as you can see), now is taking more than 45 minutes to finish.

Egrodrigues2014_1-1665685816496.png

We already have Fullstimmer option active, but we don't know why only the real time was increased and why only this week. 

 

It might be an I/O issue. Does anyone know how we can identify if it is really an I/O problem or if it could be some other issue (network, for example)?

 

In old documentation there are Fullstimmer stats that could help to identify the problem but they do not appear in log. The stats are those down below, as you can see:

Egrodrigues2014_2-1665686278510.png

Please, if anyone has faced this problem before or knows what is missing, please, let us know. It will be more than welcome.

 

Thanks in advance.

12 REPLIES 12
ballardw
Super User

What does the rest of the log look like for that job? Are the number of observations processed in each step roughly the same as before? If your long run was using 4 times the data then expect longer run times.

 

Look at each procedures summary. While the basic output does not show all those details you could still see if one or more steps is taking longer or typically the number of observations processed.

 

Any changes in network traffic on your system?

Size of output change?

Egrodrigues2014
Fluorite | Level 6

Hi @ballardw ,

 

To be more specific, this is one of the queries that have increased in time dramatically. As you can see, it is reading from a data base (SQL Server, VAULT schema) and work and writing in work directory.

 

Egrodrigues2014_0-1665748830660.png

Number of observations its almost the same:

Egrodrigues2014_1-1665748830680.png

 

We asked customer about any change in network traffic, and they said still the same.

Quentin
Super User

It's hard to tell from the logs, but the code you showed is using implicit pass-through PROC SQL.  That always scares me, because SAS decides whether to let the database do all the work of the query, or whether to pull all the data down to SAS and have SAS do the work.  So this might be a case where SAS has decided to pull all the data down to SAS, and that can substantially increasing job time.  From the log, it doesn't look like that is happening, but just a thought.  I generally feel like coding explicit pass-through is safer, and also makes it easier to run the exact query from SAS and SSMS or whatever query tool.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Tom
Super User Tom
Super User

I would investigate the remote database.

The data being queried could have grown.  Data updates could have unbalanced the data.

There might be new users access the same database competing with your query.  

 

Get the DBA for the remote database to investigate.  It might be harder to find the information on the previous runs for comparisons.  But they might be able to make some optimizations the database that could improve things without making any code changes on your end.  Or they could might suggest a better query you could use.

 

 

Sajid01
Meteorite | Level 14

Hello @Egrodrigues2014 
As one can see clearly from the logs that stats are identical except for real time.
The two probable causes would be (1) network congestion (2) some thing happening at the database level.
If this has happened only in one instance than there would be some issue in the above two areas then better to ignore. This is like an accident causing traffic delay.
If it is happening on a regular basis then the SAS team should work with network and DBA teams to identify the causes. Possibly there could be a new process running at the same time for example a backup or ETL process.
The issue brings to fore and often reiterated statement "processing should take place where the data resides"

 

Quentin
Super User

Agree with @ballardw .  

 

I would start by looking for any individual steps that have increased in time dramatically.

 

You didn't say much about what this job is doing.  Is it reading or writing data from a database?  Then you would want to look into potential database changes, or changes to the network between you and the database.

 

Is it reading / writing a bunch of files stored on a network far away from the SAS job? 

 

My understanding is if real time increased dramatically without CPU time increasing, it suggests SAS is waiting for something.  Meaning it's waiting for I/O, network, database, etc.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
SASKiwi
PROC Star

Examine your complete SAS log, compare it with last week's one and locate the step(s) that are much slower. If, for example, you are reading an external database and that is the only step that is slower then you have found your reason. If all steps are slower then it could be you are getting much poorer IO performance. Typically the bigger the difference between real time and CPU time means your job is IO constrained.

Sajid01
Meteorite | Level 14

Hello @Egrodrigues2014 
Disk I/O would impact every run. The difference in the real time between the two runs indicates that this is not the likely cause.
Often when a program pulls data from an external database then either due to network congestion or some other changes at the database level could cause such a difference in real times between the two runs. I have observed this  happening in practice.
In the present case, I suggest that the entire log be reviewed  and the 'rate limiting step(s)'  (steps requiring the large run times) need to be identified. This will give an idea of what is causing the issue.  

 

Egrodrigues2014
Fluorite | Level 6

Hi @Sajid01 ,

 

We are aware that could be a due to network congestion issue. We asked customer about any change in network traffic, and they said still the same.

As it executes as batch process in production, which runs at night (approx 3:00 am), we cannot monitor it at the moment using OS tools (Windows Server 2012 R2). Is there a way to save the historical information of I/O disk and network usage during batch execution? It is a customer's server and we do not have permission to install any monitoring software.

Quentin
Super User

Looks like that one step is your main culprit.

 

I'm not a DBA, but usually my first step would be to run the query via SSMS (ideally on with SSMS running somewhere close to the database), and see how long it takes.  This should help you get a sense of whether the problem is on the database side, or the network transfer.  You can play with running the query from SAS (via explicit pass through), from SSMS in the same place where your SAS session is running, and then on SSMS close to your server.  Of course if you have a helpful DBA, they should be able to look into logging from the SQL server side to see if this query really is taking four times as it took last week.

 

But my money is on network stuff, and I guess for that you would need to have a network engineer involved.

 

Often I will made a little job just of one problematic query.  Then I can run it during the day for testing, schedule it at different times at night, etc.  Just to see if there are any differences.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Kurt_Bremser
Super User

When real time exceeds CPU time, there are two common reasons:

  1. I/O does not keep up with your needs, causing wait states
  2. Multiple processes competing for resources, so your SAS gets only a slice of the available resources

A combination of both is of course possible. Multiple SAS jobs making heavy use of WORK can cause latencies there, especially if you still use spinning metal.

 

So you need to use system monitor tools to see which processes are running concurrently. I used to run topas on our AIX SAS server all the time to watch its performance state.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Egrodrigues2014 

 

I have encountered a similar problem before. A batch job was delayed so it collided with the daily backup of the database, and that dramatically increased download time. There could be many other reasons for your problem, but it might be a good idea to check if your long-running job was started at the same time as usual, and contact the database admin and ask when the daily backup is running, so you can schedule your job to avoid conflict.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 2153 views
  • 6 likes
  • 8 in conversation