07-27-2017 08:23 PM
My server specs:
C.P.U.: 16 Cores Xeon
Memory: 128 Gb
H.D.: 5 Tb 7200 rpm and SAN
I have a new question, i write a program to make a query to get some data from a table with 5 millions of observations.
i run the program from cli
<my SASHome>/SASFoundation/9.x/bin/sas_en program.sas
only one process takes 4 segs, but if I launch randomly 300 processes in a period of 15 - 20 mins these take from 5 secs to 15 minutes, all process launching are distributed in a normal gaussian not run all at same time.
I dont know why some processes take 15 mins to finish when i run the processes i use approximately 20 Gb of memory and 10-20% of each core of the server. Any idea?
Thanks in advance.
07-27-2017 08:57 PM
You wrote: "all process launching are distributed in a normal gaussian not run all at same time."
Are there any time overlapping processes that run at the same time?
Can it be that some resources are held by process-1 and makes process-2 to wait?
Running 300 processes takes overhead time to manage them.
Server may be busy with other jobs.
can you monitor start time and end time of each process in order to check overlapping with long time processes?
07-28-2017 12:08 AM
You still have 7200 rpm disks in a server? That means a lot of latencies, and that is revealed as soon as parallel usage happens.
I bet your problems come from the I/O subsystem.
Replace the disks with SSDs.
07-28-2017 12:08 AM - edited 07-28-2017 12:09 AM
It is hard to understand what is happening on a system w/o more information. In a system, they normally operate on a round robin sort of flow so some things may be waiting for a system to free up. With loads of processes, and SAS jobs (in general), you are I/O bound.
Normally SAS is I/O bound. The trick is to separate the I/O channels but that may not be your issue w/o more info.
Can you split the jobs so that they are reading from different I/O channels? You indicate no issues on RAM or CPU but that is normally not what slows SAS down.
Too many jobs actually can hurt a system since it will constantly buffer in and out. You are normally better off finding the optimal level vs picking a random number, like 300, and shooting them all at the same time. Do 10, then 50, then 100, so on. Find a balance.
In general, the issue is 'normally' not a system issue but the code logic for what you are trying.
Hip shooting here, just what I have seen.
07-28-2017 01:44 AM
+1 for I/O botleneck.
It's extremely likely that each of your job is trying the read a different part of your disk, and the disk spends (a lot) more time seeking than reading.
Create a RAM disk and load the data there. All these random accesses will be a lot faster.