04-12-2016 05:23 AM
Hi fellow admins,
We are in the process of rolling out a multiple grid environments for a large population of data scientists. We use LSF for the grid management. One of the key components is the grid launched workspace server. Now I am struggling to bring down the time it takes to start a workspace server. The time is now at a minimum of 20 seconds. This is a big stumbling block in the acceptance by the users and I understand why. When using DI Sudio wss's are started all over the place. In EG one experiences an agonizing half minute of hourglass watching.
I have already tweaked a few parameters in lsb.params according to a blog post from Edoardo Riva but I am now out of ideas. That's why I turn to you.
Many thanks in advance,
Begin Parameters MAX_JOB_NUM=10000 NEWJOB_REFRESH=Y DEFAULT_QUEUE=normal ABS_RUNLIMIT=Y MIN_SWITCH_PERIOD=3600 JOB_SCHEDULING_INTERVAL=1 JOB_ACCEPT_INTERVAL=1 JOB_DEP_LAST_SUB=1 ENABLE_EVENT_STREAM=n MAX_CONCURRENT_QUERY=100 ENABLE_HOST_INTERSECTION=Y MBD_REFRESH_TIME=10 #MBD_SLEEP_TIME=10 MBD_SLEEP_TIME=1 #SBD_SLEEP_TIME=5 SBD_SLEEP_TIME=1 End Parameters
04-12-2016 06:01 AM
Which versions of SAS & LSF are you using and on which platform? When you look through the logs can you see where most of the delay occurs?
Have you seen the following note?: SAS Problem Note 57577: You encounter delays when you start grid-launched workspace servers or when ... Does it apply to your situation?
04-12-2016 07:01 AM
This is SAS 9.4M3 and LSF 9.1.3.
I have seen the note. It does not apply:
Job <1154>, Job Name <SAS Enterprise Guide_SASApp - Workspace Server node 01_F7 52E162-0AE0-0345-842F-EA85270DCC20>, User <klavj10>, Proje ct <default>, Command </srv/SASConfig/Lev1/SASApp/Workspac eServer/WorkspaceServer.sh -noterminal -netencryptalgorith m AES -encryptfips -metaserver osasigmdl03.ont.belastingdi enst.nl -metaport 8561 -metarepository Foundation -locale en_US -objectserver -objectserverparms "delayconn sph=osas igndl01.ont.belastingdienst.nl protocol=bridge spawned spp =36720 cid=18 pb classfactory=440196D4-90F0-11D0-9F41-00A0 24BB830C server=OMSOBJ:SERVERCOMPONENT/A52BHKER.AY00000Q c el=everything lb grid" -METAUSER '"klavj10@!*(generatedpas sworddomain)*!"' -METAPASS 7720093ab3A185107f65931940859c7 1 > Tue Apr 12 11:30:41: Submitted from host <osasigndl01.ont.belastingdienst.nl>, to Queue <eguide>, CWD <$HOME>, Specified Hosts <osasigcll 01.ont.belastingdienst.nl>, <osasigndl01.ont.belastingdien st.nl>; Tue Apr 12 11:30:41: Dispatched 1 Task(s) on Host(s) <osasigndl01.ont.belasting dienst.nl>, Allocated 1 Slot(s) on Host(s) <osasigndl01.on t.belastingdienst.nl>, Effective RES_REQ <select[type == a ny] order[r15s:pg] >; Tue Apr 12 11:30:41: Starting (Pid 4890); Tue Apr 12 11:30:42: Running with execution home </home/ONT/klavj10>, Execution CWD </home/ONT/klavj10>, Execution Pid <4890>;
This shows what the note calls a "healthy grid" with a one second delay. I will continue investigating log files to sdee where the delay happens. We have an additional app server for SASEM that is not grid launched. There we see apporox. 5 seconds. So that's what we're aiming at. Minus of course some overhead.