Did you know that at SAS we work with three different kinds of grid technology? It's crazy-town banana pants time around here when determining which grid(s) we're actually talking about sometimes. The confusion which arises from discussions around system requirements, deployment, operation, and administration of these grids occurs all too often because the people involved are really talking about different underlying grid technologies but are simply referring to "the grid". Let's try to straighten it all out.
In general terms, a grid is simply a collection of computers which are grouped together to perform a common goal. A grid is conceptually different from a cluster in that each node of a grid is usually directed to perform independent jobs whereas a cluster will automatically divide a single task across multiple hosts. If you think that "grid" and "cluster" are interchangeable terms, they actually are in some regards - it really depends on one's perspective on things, such as the meaning of "common goal" and "independent job". No doubt you understand why this is then a companion discussion to my SGF 2017 paper Twelve Cluster Technologies Available in SAS 9.4. ;)
At SAS, we're exposed to nearly the entire spectrum of the grid - cluster space. So buckle up and let's dive deeper. The kinds of grids we see at SAS are:
Let's very briefly look at what makes these grod technologies different from each other.
The grid technology provided by SAS Grid Manager allows us engage many different compute servers of different sizes and resouce capacities. SAS Grid Manager is designed to work with a third-party load balancer product, either IBM Platform LSF or Apache Hadoop YARN. LSF offers a far richer set of capabilities than relative newcomer YARN - but a SAS user doesn't necessarily need to know or care. They can write and submit their code exactly the same and the grid technology will get the job done.
Intelligent load balancing is a major feature of this kind of grid technology. For example, in a grid where half the servers have 8 CPU each and the other half has 12 CPU each, the load balancing software is aware of the difference and it will on average send more jobs to run on the 12 CPU machines. But that's not all, depending on the load balancer employed, the grid administrators can direct load balancing based on factors such as priority and pre-emption, isolate resources for certain tasks or teams, and much more.
For more information about SAS Grid Manager software, see:
The SAS High-Performance Analytics Environment (a.k.a. tkgrid) provides the software framework used for SAS High-Performance Analytics Server solutions as well as SAS LASR Analytic Server. The idea with tkgrid is that it allows us to start a single logical software service which is distributed to run across multiple host machines at the same time. In the introduction of this blog post, we made a distinction between a grid and a cluster. In this regard, the tkgrid software acts more like a cluster in that it automatically breaks a single job up to run across multiple hosts simultaneously. While this is an important detail in understanding how tkgrid-based software works, it is more interesting to this discussion in that the naming of this tkgrid software certainly leads to some confusion with other grid technologies.
For more information about tkgrid software, see:
SAS Viya is our new brand flagship and banner under which our newest technologies and capabilities will be delivered in years to come. The engine driving Viya is SAS Cloud Analytic Services. In many ways, CAS is similar in concept to the next generation of tkgrid, but it's directed by a broader charter to act as the central analytic engine for all of SAS Viya. In other words, it represents a massive engineering effort to recompletely refactor how SAS software will process and analyze data. It offers more than improved scalability and availability over tkgrid - it is intended to be fully elastic, capable of running on one machine or many, direct on physical hosts, in virtualized environments, and also automatically managed by cloud vendors.
For more information about CAS software, see:
A generic grid is used here as a placeholder referring to any collection of server machines which are expected to work together and need administrative alignment. There is one piece of SAS-branded software which works specifically with a generic grid: the SAS High-Performance Computing Management Console.
SAS HPCMC is a rebranded OEM copy of the eponymous sysadmin utility known as Webmin. The primary job of HPCMC is simply to create and manage operating system users, groups, and their associated SSH keys across a collection of machines (referred to as "gridhosts"). The use of HPCMC is completely optional in support of SAS high performance deployments. If your customer has some other user administration toolset they prefer which can create OS users, groups, and SSH keys as required by SAS software, then they can use that instead.
You'll find that HPCMC is included with the SAS High-Performance Analytics Infrastructure bundle along with the HPAE software as well as the SAS High-Performance Distribution of Hadoop.
For more information about the HPCMC software, see:
It is entirely possible to combine some or even all of these differrent grid technologies into a single customer deployment. The challenge is that these grid technologies are not really aware of each other and each has its own unique set of requirements and administration.
Let's look at a couple of common entanglements:
The tkgrid software is delivered alongside HPCMC in a bundle known as the SAS High-Performance Analytics Infrastructure. Each piece of HPAI has its own set of software instructions provided in our documentation - see the HPAICG doc. The first piece to install is HPCMC and a requirement is to create a file named /etc/gridhosts with a list of all machines in the environment which need a common set of users, groups, and SSH keys. This list will include the SAS Metadata Server host(s), SAS Compute Tier host(s), and all of the tkgrid hosts. When HPCMC is started, it looks to the list of machines in /etc/gridhosts so it knows where to act.
Moving on to the installation of tkgrid (HPAE), there is an installation step directing you to provide a list of machines to host tkgrid in a file which is also named /etc/gridhosts. This file is conceptually different from the one used by HPCMC - it should only list the tkgrid host machines. Furthermore, this /etc/gridhosts file is only referenced at installation time - it will be copied to a file named grid.hosts in the designated tkgrid directory. For clarity, my personal preference is to create this installation-time file as /tmp/tkgridhosts instead.
Keeping the concept of the different /etc/gridhosts files straight is important to a well-planned and executed deployment.
It is increasingly common to discuss architectures which combine SAS Grid Manager alongside tkgrid. One popular scenario is to attempt to share a common set of machines with both grid technologies. It is possible but there are certain considerations and tradeoffs to weigh since the two grids don't directly coordinate their activities with each other. One idea is to configure SAS Grid Manager to only utilize resources when they are otherwise idle (for example, only at night) which allows tkgrid to perform best since its design defaults to assuming all machine resources are equally available.
Notice that the only load-balancing performed by tkgrid is when it attempts to equally distribute data to all worker nodes. This arrangement is based on the simple assumption that all machines which act as hosts to tkgrid services are pretty much identical in available resources and workload. Contrast that to SAS Grid Manager paired with Platform LSF which carefully weighs all resource metrics in determining how to distribute workload. These two technologies take very different approaches to load balancing and it's up to us as architects and administrators to keep them playing together nicely.
There's another approach - using Apache Hadoop YARN for load balancing. Both tkgrid and SAS Grid Manager can integrate with YARN. This assumes, of course, that Hadoop is *also* deployed on the same set of machines. It's getting crowded in here! Furthermore, YARN is still early in its development and currently acts very much like a lazy and vindictive maître d'. It takes "reservations" for resources which may or may not be used. It is possible to make a resource reservation, but never use it - and if so, the lazy YARN maître d' will sit idle waiting for a specific task which never appears while blocking any new tasks from starting. On the other hand, if the pre-defined limit of the "reservation" is exceeded, then the vindictive YARN maître d' simply kills the job (instead of suspending it or running at lower priority).
As you can see, there are numerous aspects to the grid technologies used by SAS. It is easy to imagine where we might inadvertently use terms like "grid" and "gridhosts" in discussing different software offerings in a way that can be confusing within our own teams as well as with customers. While we've only scratched the surface here, the goal of this post is to help convey the need that we all need to be careful in terms of documentation, communication, planning, and administration when it comes to working the any of the grid (and grid-in-name) technologies at SAS.
Rob Collum is a Principal Technical Architect with the Global Architecture & Technology Enablement team in SAS Consulting. When he's not making another trench run after turning off his targeting computer, he enjoys sampling coffee and m&m’s from SAS campuses around the world.