With 9.4 M6, SAS Grid Manager is now offered with a new default grid workload manager: the SAS Workload Orchestrator. The SAS Workload Orchestrator is a fresh approach to give SAS Grid Manager a new way to manage workload distribution. Don’t worry, SAS continues to offer variations of SAS Grid Manager for use with IBM’s Platform software or with Apache Hadoop’s YARN and those products will continue to be supported ongoing into the foreseeable future.
SAS Workload Orchestrator works differently than those other grid workload managers. And when deploying the SAS Grid Manager solution with SAS Workload Orchestrator, we have some new considerations to address after the SAS Deployment Wizard has finished its tasks.
The SAS Workload Orchestrator service can be operated individually on each grid host using the [SASCONFIGDIR]/Lev1/Grid/sgmg.sh utility script to start, restart, stop, or query status. But for a grid with many compute hosts, having to invoke the Grid/sgmg.sh script on each and every machine can get tiresome.
So the SAS Workload Orchestrator also offers two Python scripts in that same Lev1/Grid directory: gridStart.py and gridStop.py. As it turns out most standard Linux deployments probably have the base Python components installed and ready to go. For gridStart.py, that's great. In that case, it works right out of the box and when executed, will automatically establish SSH connections to each of the grid hosts and execute the Grid/sgmg.sh script with the start parameter.
However, gridStop.py needs a little more attention at initial deployment. When it runs, it doesn't use SSH to contact the grid hosts to execute Grid/sgmg.sh with the stop parameter. Instead, it uses HTTP to contact the SAS Workload Orchestrator process's RESTful API on the master host and directs it to shut down the grid.
In order for gridStop.py to make those HTTP RESTful API calls, it relies on a Python library named "requests". The Python requests library may not be installed - and so you're responsible to do so. There are four commands you can use to install the infrastructure needed to get the requests library:
This install procedure is only needed on the host machine(s) where you plan to invoke execution of the gridStop.py script - not on every grid host.
With the gridStart.py and gridStop.py scripts fully functional, then it's easy to operate the SAS Workload Orchestrator process across many machines at once.
SAS Workload Orchestrator automatically tracks many aspects of resource utilization on the grid host machines. It monitors the number of processes running, how much CPU and RAM they’re consuming, network availability, disk activity, and much more. All of that information is helpful in determining which host is best suited to run the next grid job.
The SAS Workload Orchestrator can also monitor the activity and resource utilization of individual grid jobs as they're running. Some of it is very specific and may be of limited value except in certain circumstances. In order to monitor a few specific job statistics, the SAS Workload Orchestrator needs escalated privileges. That's because grid jobs are run as the userid requesting them - whereas the SAS Workload Orchestrator process is usually running as userid of the SAS Installer account.
To grant the SAS Workload Orchestrator the specific privileges it needs, the Grid Manager deployment guide directs us to grant file capabilities on the bin/sgmg executable:
Setting file capabilities in this way is a great approach in support of the Principal of Least Privilege. The idea is that we're only granting the privileges which are needed - and nothing more. This security practice is an important design consideration which the SAS Workload Orchestrator can employ in environments which support it.
As someone familiar with SAS grid deployments already, you already know that a single deployment of the SAS Compute Tier on a Linux (or UNIX) host can be shared across multiple machines. One install for many hosts. Choosing the correct shared file system for this purpose is an important task - usually involving third-party software and dedicated hardware to achieve the necessary level of service.
However, for grid deployments which are not especially sensitive to performance or are otherwise not really mission-critical, like dev/test environments or for proof-on-concept implementations, then plain old NFS has historically been sufficient to the task… until now.
Unfortunately, the currently active implementations of the NFS protocol do not convey file capabilities to remote hosts. So if your site is relying on NFS as the shared file system technology to mount a single installation of SAS Compute Tier software across multiple grid hosts, then we must use something other than file capabilities.
If you want SAS Workload Orchestrator to monitor grid jobs' disk I/O statistics - and if file capabilities are not working - then there's an alternative approach: enable setuid on the bin/sgmg executable instead.
You're already seen setuid in action for SAS executables like elssrv, objspawn, and sasauth. Those processes run with the ability to use the full set of root privileges… but they just use a fraction of that power. We can do the same with bin/sgmg so that it can see those disk I/O stats:
Do not implement both file capabilities and setuid on bin/sgmg. That's derpy. Choose the right one for your environment.
Establishing the appropriate level of privilege isn't enough. When root-level privileges are enabled on a file (either using file capabilities or setuid), then one feature of Linux is to automatically change the way shared library files are located when running the newly capable bin/sgmg executable. Specifically, the LD_LIBRARY_PATH environment variable is no longer the correct way for SAS to find the library files it needs - especially those for encryption.
Instead we must define the path to the required SAS library files using a different approach:
The -v option directs the ldconfig utility to provide you with a verbose listing. Near the top of the output you should be able to confirm that *.so files are available from:
Now that bin/sgmg and the operating system are configured to work well together, then SAS Workload Orchestrator has the additional privileges it needs to monitor disk I/O for grid jobs.
Let's confirm that the SAS Workload Orchestrator can actually monitor the disk I/O of grid jobs:
At this point, any new jobs submitted to the Default queue will be dispatched with a limit of 9,999 MB of local disk I/O. If that limit is exceeded, then SAS Workload Orchestrator will kill the job.
The other statistic we enabled SAS Workload Orchestrator to monitor with its escalated privileges that can be selected as a Limit criteria is MaxIoRate. Keep in mind that MaxIoTotal and MaxIoRate are only able to monitor local disk I/O. If the grid job is accessing all files over NFS, then that's a network measurement, not local disk I/O.
And finally, if MaxIoTotal and MaxIoRate are not limits that you want to measure in this grid environment, then there's no need to enable escalated privileges (and all it entails) for SAS Workload Orchestrator.
SAS Workload Orchestrator runs on every grid host machine. At startup, one host acts as the grid master. If the grid master fails for some reason - machine crashes, process killed, network interrupted, etc. - then the SAS Workload Orchestrator process on another host will take on the role of master.
In the section above describing how to monitor disk I/O for grid jobs, did you notice that the url to view the SAS Workload Orchestrator's web interface references the host machine of the current master: http://[SWO-MASTER.site.com]:8901/sasgrid/index.html?
If that master host goes offline, then the grid is still functional - but the web interface won't be available at that same url. We need to configure the environment so that one static URL will automatically re-direct to the any SAS Workload Orchestrator master candidate host:
With these changes, then the SAS Web Server is configured to act as a reverse proxy to automatically redirect requests for the SAS Workload Orchestrator web interface to any of the master candidate hosts. Further, the SAS software is configured to reference the new reverse proxy instead of the first-deployed grid host.
So instead of this original URL to access the SAS Workload Orchestrator web interface:
You will use the reverse-proxy configuration in the SAS Web Server to access the SWO interface:
If you would like to learn more about the new SAS Workload Orchestrator as well as SAS Job flow Scheduler and other new capabilities in SAS Grid Manager at 9.4 M6, refer to the Grid Computing in SAS® 9.4, Fifth Edition documentation.
When working with a multi-tiered (or multi-machine) deployment of SAS software, coordination of SAS software services as they run across hosts is important. SAS Technical Support provides the SAS_lsm utility to help manage the operations of SAS software on multiple machines.
A special word of thanks to Darwin Driggers for his long-suffering assistance in working through the nuanced aspects of this topic with me; to Doug Haigh for deep insights; and to Scott Parrish and the rest of the grid team for their cogent input as well.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.