SAS Viya has fully embraced a Continuous Delivery strategy, and new functionality gets released monthly. Sometimes this includes new components and changes or additions to the architecture; it’s important for SAS Administrators to keep up to date with the software.
The SAS Viya 2020.1.3 stable release includes an open-source distribution of Elasticsearch providing search capabilities to SAS Viya applications. This post describes its architecture and deployment in 7 points.
Those of you who are have worked with SAS Visual Investigator, which is included in offerings such as SAS Intelligence and Investigation Management, may be already familiar with Elasticsearch. The same holds if you are already using the SAS Viya Monitoring for Kubernetes framework. Starting with the 2020.1.3 stable release, it will also empower search capabilities with SAS Information Catalog.
Starting with the 2020.1.3 stable release, SAS Viya includes a framework built around Open Distro for Elasticsearch, in short ODFE.
Elasticsearch itself is a distributed search engine.
Why this nesting? It all boils down to licensing and legal terms. Although the core Elasticsearch is open-source, most of its enterprise-level plug-ins (i.e the ones providing secure access) are covered by a commercial license. On top of it, Elasticsearch (the company owning the trademark and the copyrights) has recently changed the license of Elasticsearch (the product) to SSPL, in practice prohibiting any commercial offering of managed hosting services, to counter Amazon Elastisticsearch Service.
ODFE, sponsored by Amazon, avoids these legal issues by providing both the core and select enterprise-level plug-ins under the Apache license.
SAS Viya provides Open Distro for Elasticsearch 1.7.0 as a base, upgraded to work with Elasticsearch 7.6.2, along with various ODFE plug-ins and a SAS-built Kubernetes operator to handle their installation and management.
Included ODFE plugins: opendistro_security, opendistro_index_management, opendistro-job-scheduler.
We also include additional Elasticsearch plugins, without modification: analysis-icu, analysis-kuromoji, analysis-nori, analysis-phonetic, analysis-smartcn, analysis-stempel, mapper-murmur3.
A final note about the release. SAS provides a version compatible with Elasticsearch 7.6.2 because this is the latest version currently supported by Janusgraph (see below what this is). As soon as possible (depending on Janusgraph) SAS will probably upgrade to Elasticsearch 7.10 (the latest version fully covered by the Apache license).
For the current release, Open Distro for Elasticsearch is included in SAS Information Catalog, which is included in all SAS Viya offerings except SAS Data Science Programming.
SAS Information Catalog stores information inventory for SAS Viya data assets in the SAS Infrastructure Data Server (PostgreSQL), then uses ODFE to index pointers to that content. Indexed information is then used to search content. SAS Information Catalog indexes various information about the objects that are getting pointed to, like name, descriptive text, keywords, etc. In future releases of SAS Viya, by upgrading the license to SAS Information Governance, you will have the option to use a specialized graph database (such as JanusGraph) instead of PostgreSQL.
In the future, additional SAS Viya applications may be able to leverage the same ODFE provided with SAS Information Catalog.
SAS Information Catalog includes the following container images:
(SAS Information Governance will add sas-janusgraph in a subsequent release)
Of these, Open Distro for Elasticsearch is implemented by pods containing the sas-opendistro container, while the operator pod uses the sas-opendistro-operator container. The operator also uses a Kubernetes custom resource, called OpenDistroCluster, to describe the desired state of the cluster to instantiate.
Each SAS Viya deployment supports a single Open Distro for Elasticsearch cluster. Each ODFE cluster can be composed by one or more ODFE nodes, controlled by a Kubernetes Statefulset: each ODFE node runs in a pod named sas-opendistro-default-0, sas-opendistro-default-1, etc. These pods have a SAS workload class = stateful.
The sas-opendistro OpenDistroCluster is the resource used to define the cluster, the operator then creates the StatefulSet objects and manages them. The operator runs in a single pod with a SAS workload class = stateless, and is implemented using the Operator SDK.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Supported topologies and constraints are documented in the included Readme file available at <sas-bases/overlays/internal-elasticsearch/README.md>:
Elasticsearch documentation recommends dedicated master nodes and data nodes for production systems. Availability considerations (see below) suggest 3 master nodes and 3 data nodes. The number of data nodes can be increased to support searching more data.
For the current release of SAS Viya (2020.1.3), the topology (number and role of ODFE nodes) must be chosen at deployment time and cannot be
Open Distro for Elasticsearch requires persistent storage. By default, each ODFE node (whether with master or data role) creates one PVC with a default size of 128Gi and accessMode RWO, using the default Kubernetes StorageClass.
For production environments it’s recommended to use local, fast block storage (aka disks), not a remote filesystem such as NFS. An example kustomize transformer to use Azure Premium disks is described in the included Readme file available at <sas-bases/examples/configure-elasticsearch/internal/storage/README.md>
Internal testing in SAS labs have shown some approximate sizing:
(1M objects is what you will get if you scan approximately 10K tables, 100 columns per table. Each is 1 item in the search index)
Open Distro for Elasticsearch is seamlessly deployed with SAS Viya 2020.1.3 and later. A sample kustomization.yaml file can be found in the deployment instructions.
When upgrading from a previous release, a few lines specific to the new software should be added to the existing kustomization.yaml as explained in the deployment notes for Version Stable 2020.1.3
In practice, the instructions are the same in both cases: add few new lines to kustomization.yaml. They include one note worth calling out:
Note: The sysctl-transformers.yaml transformer uses a privileged container to set vm.max_map_count. If privileged containers are not allowed in your deployment, do not add this line. Instead, the Kubernetes administrator must set the vm.max_map_count property for stateful objects manually.
That property should be set on all the Kubernetes nodes that can host a SAS workload class = stateful, by logging into the node and setting the property at the OS level as described in these instructions.
The default deployment with one single Open Distro for Elasticsearch node does not provide high availability.
For the current release, which is only used by SAS Information Catalog, this should not be an issue. In fact, SAS Information Catalog rebuilds the search index from the information catalog stored in PostgreSQL if it goes down. Even in case of disaster recovery, the search index is not backed up nor restored; SAS Information Catalog uses the information catalog service to rebuild the index if it needs to.
It is possible to configure ODFE for High Availability by following the instructions in the README file available at <sas-bases/examples/configure-elasticsearch/internal/topology/README.md> When designing a High Availability environment, consider the following points:
Open Distro for Elasticsearch offers new exciting capabilities to SAS Viya. Correct understanding of its architecture will empower SAS administrators to build the best environment to suit end users’ requirements.
Many thanks to Terry Quigley, Iain Jackson, and Nancy Rausch for the great information provided while writing this post.
Find more articles from SAS Global Enablement and Learning here.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.