SAS Viya Architecture Design in GCP: the basics

7 Likes

As Google Cloud Platform (aka “GCP”) is getting more and more interest from our customers, I thought it could be good to share some basic recommendations I recently provided to one of our customers in France for running SAS Viya in the Google public Cloud.

Many recommendations are based on the excellent SAS paper by Margaret Crevar covering the 3 major Public Cloud vendors: SAS Paper 1866-2018: Important Performance Considerations When Moving SAS® to a Public Cloud (for both SAS 9 and SAS Viya).

The Google Cloud is nicely documented so I’m not planning to duplicate the official documentation here, but I reference it, so you can deepen your research if you’d like.

The goal is to highlight some of the GCP specificities, raise awareness of the key recommendations from the SGF paper and share some learnings, information of interest that could influence your design decisions, so you can start the architecture discussions with your customer.

Speaking of your customer ... make sure that your customer has the right skills for the Cloud they choose to use. While it's great to have the freedom to "do what you want," it's much better to have an active discussion with their teams. You would not want to propose something that would cost them more than they thought, or even worse, violates their IT policies. If IT is not at the negotiation table, you'll never know about those policies until it has become a very awkward conversation.

Since the content of this post has grown far beyond what was originally intended, here is a table of content below that you can use to jump directly to the topic of your interest. 🙂

Google Compute Engine (GCE)
Network considerations
Storage in GCP
- Available solutions
- Throughput
Storage recommendations for SAS Viya
Conclusion

Google Compute Engine (GCE)

According to Google documentation “Compute Engine instances can run the public images for Linux and Windows Server that Google provides as well as private custom images that you can create or import from your existing systems.”

GCE is the Google Cloud equivalent of the EC2 service in AWS – we are talking about IaaS (Infrastructure as a Service). What we get is a virtual server with an OS and we can connect to it and perform operations at the Operating System level (like installing binaries, edit files, run commands and system services, etc.…).

Google also offers many other services or mechanisms that could be potentially used in a Viya deployment such as HTTP load-balancers, Managed Instance groups (with auto-healing and auto-scaling features), ODBC compatible Databases, and something called GKE (for Google Kubernetes Engine) where you can provision containers instead of complete machines – but that is out of the scope of this article.

Machine Types

We can choose the machine properties for our GCE instances. The machine types are Virtual Machines with predefined configuration, basically a fixed number of “vCPU” and a given amount of RAM.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In GCE, there are currently 5 pre-defined families of machines types: standard, high-memory, high-cpu, shared-core and memory optimized.

The main difference between the machine types families is the CPU-RAM ratio (from 0.9 to 24 GB per vCPU) and of course the prices vary depending on the machine type.

It’s worth noting that unlike AWS, GCP also allows you to “customize” and tweak this CPU/RAM ratio and add GPU processors.

Predefined machine types have preset virtualized hardware properties and a set price, while custom machine types are priced according to the number of vCPUs and memory that the virtual machine instance uses.

You can find the details of the machine types here but for Viya we already have some recommendations depending on the machine role in a Viya deployment:

SAS Programming Run-time (SAS PRE and associated services): if most of the analytics heavy processing is done in CAS, the “standard” series machine will usually be appropriate. Otherwise, you will need an instance with more resources : the SAS PRE can and will demand high CPU and memory resources depending on the type of workload.
SAS Viya Services: the “high-memory” series is a good choice as the Viya services require higher amount of memory to run the numerous microservices and web applications. Those machines have 6.5 GB of system memory per vCPU. If all the SAS Viya services are planned to be on the same host, at least a n1-highmem-16 machine type is required (with 104 GB of RAM).
CAS nodes: We know that these servers require fast CPUs for processing data and enough physical RAM to hold all the data files to be analyzed by all the concurrent SAS Viya users. CAS is an In-Memory analytics engine. So, the “high-memory” (from n1-highmem-16) or even the new “memory optimized” series (up to 48 GB of RAM per physical core) will be a good fit*

(*) Note: be careful with “ultramem” and “megamem” instances as they are not available in all the Google Cloud zones.

A core is not a core

Something important to also understand is that GCP express the processing power as vCPU (short for virtual CPU). GCP machines are configured with Hyper-Threading (two threads per physical CPU core) which means that vCPU listed in the above tables are threads and not physical cores.

To summarize, 1 vCPU in GCP = 0,5 physical CPU core.

So, the GCP instances vCPU count must be divided by 2 to match the number of cores used in the SAS licensing model. As the HT information is visible from the host, the CAS license control algorithm can apply properly.

The “lscpu” command gives a lot of information. See below the output for a n1-standard-8 (8 vCPUs, 30 GB memory) instance.

With the “thread(s) per core” value of 2, we see that Hyper-threading is enabled.

One interesting learning is that (unlike AWS), GCP does not expose the exact CPU chipset that is used in the underlying host (we only see that the VM host is running 2.3 GHz Intel Xeon processors).

It is also worth pointing out that the NUMA information being provided by GCP may not be wholly accurate in certain situations. In a simple test where a 64vCPU instance in GCP was set up, the number of physical cores was reported as 32 but the number of sockets was reported as 1. However, in searching the Intel pages the GEL team could not find any current Xeon E5/E7 processors with more than 28 physical cores per socket. See this Intel page for more details.

Sizing

Finally, two things generally drive the GCE Machine type choice:

EEC Sizing: CPU/RAM ratio and amount of memory from the EEC sizing.
- In the CAS world, the amount of memory is directly linked to the amount of data that you want to work with.
Number of licensed cores
- In an ideal world, it corresponds to the number provided in the EEC sizing. However, we know that sometimes it is not so simple ??

Your role will be to map the sizing recommendation and licensed number of cores to the GCE instance types.

If other applications are co-located with machines of your Viya deployment, you also need to take them into account.

For example, if your CAS nodes are a co-located with a Hadoop cluster, you need to provision more resources RAM, Disks and more CPU cores than what appears in the CAS license (CAS is able to restrict its own utilization to the number licensed cores).

Network considerations

Egress and Ingress traffic

Google Cloud network documentation uses the concepts or “ingress” and “egress” traffic:

ingress: traffic entering or uploaded into Google Cloud Platform
egress: traffic exiting or downloaded from Google Cloud Platform

While ingress traffic is free, egress traffic is charged based on the source and destination of such traffic. If you’d like to understand why I’d recommend this interesting article.

In GCP, the outbound traffic bandwidth (egress) of a machine is limited by its number of vCPU. Each vCPU is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.

So, for Google Cloud machines with less than 8 vCPU, the maximum network bandwidth is limited compared to a bigger instance type.

As explained here, there is no such quota in GCP for the inbound traffic (ingress). The amount of traffic a VM can handle depends on its machine type and operating system. Ingress data rates are not affected by the number of network interfaces a VM has or any alias IP addresses it uses.

A crucial parameter for CAS MPP environment performance is obviously the network speed. Installing and using tools such as “iperf” or “qperf”, before installing Viya, to gauge the network throughput and latency of the CAS infrastructure is always a good idea.

Basic recommendations

So here are some general recommendations regarding the network:

Always use the internal IP for communications
- inside Viya (between all Viya machines – especially between CAS nodes)
- and between CAS and the Data source machines (if they are also in GCP)
Put the instances in the right zone
- If there is a long geographical distance between the GCP zone hosting the Viya platform and the Data lake it impacts the performances.
- The same would apply, to a lesser extent to the users: if they are far away from the Viya environment, their access to the web app may be sluggish, due to the added latency (for a small group of users using Virtual Desktop infrastructure in GCP could be an interesting option to consider).
Choose the right core-count
- For CAS nodes, don’t use instances with less than 8 vCPU (otherwise you limit the network bandwidth between them).

When using a physical machine, you can check the NIC (Network Card Interface) speed running command such as “ethtool”. However, in GCP (as in any other virtualized/cloud environment), the instances are using virtual Network Interface Cards (NICs). Physical NICs are not directly connected to the VM but are providing an uplink to the virtual switch running on the KVM hypervisor host.

Finally, although it is not possible to choose to attach a specific physical NIC (Network Card interface), you still can attach multiple NICs to a single GCE instance and configure an application to use a specific network interface for its communications if its technology allows it.

However, in GCP remember that the network speed limitation of 16 Gbps is set at the instance level so having multiple NICs might not help with the performance.

Security

By default, all GCP instances have one internal IP address and one ephemeral external IP. Of course, for a Viya deployment, you will create an IP address for the machines hosting public facing services (such as the HTTP Apache server) and add firewall rules to open the required ports.

Although it is not your primary responsibility as a SAS Consultant, you should be aware that site-to-site VPNs are the gold standard in terms of access:

from corporate environment, up to the cloud,
but even more important, for you cloud assets to be able to reach back into your on-prem environment

Otherwise, a random machine on the internet will be allowed to talk to AD or to a database.

Finally, every VPC comes with implied firewalls rules that are applied to both outgoing (egress) and incoming (ingress) traffic in the network by default.

By default, only SSH and RDP access are allowed from the outside while internal communications between instances on the same subnet are allowed.

With the default implied rule, all egress traffic is allowed, and you cannot change it. However, you can override it with your own restrictions.

In addition, GCP always blocks some traffic, regardless of firewall rules.

Storage in GCP

Available solutions

Another key area for CAS performance is obviously the I/O speed. The SAS runtime has always been an I/O disk intensive application and it stills true with the Viya SPRE and CAS. The faster SAS can read and write the data, the shorter the analytics processing will take to run.

Today in GCP, there are four available storage solutions for your Compute Engine instances:

Zonal standard persistent disk and Zonal SSD persistent disk: Efficient, reliable block storage.
Regional persistent disk and regional SSD persistent disk: Regional block storage replicated in two zones.
Local SSD: High performance transient local block-storage.
Cloud storage buckets: Affordable object storage.

Source: https://cloud.google.com/compute/docs/disks/

For a SAS deployment we are interested in the first and the third item: “Standard Persistent disk”, “SSD persistent disk” and “local SSD”. Regional persistent disks are designed to answer specific DR requirements and the Cloud storage bucket (AWS S3 equivalent) are not supported with SAS or Viya, yet.

Throughput???

The I/O pattern for SAS is “large reads and writes”, the limiting factor is the throughput.

Persistent disk operations are subject to the egress network traffic limitation (which depends on the number of vCPU). This means that persistent disk write operations are capped by the network egress cap for your instance. SSD persistent disks can achieve greater IOPS and throughput performance on instances with greater numbers of vCPUs.

Throughput performance also depends on the disks size (especially for large reads and writes). For example, to have the same performance as a 7200 RPM SATA drive (which typically achieves 120 MB/s), you need to have at least a 1TB standard persistent disk or a 250GB for SSD persistent disk.

The table below comes from the GCP documentation and gives an idea of the maximum sustained throughput that can be achieved (with large volume sizes and highest instance types).

While CAS is less sensible to the disk I/O (it mostly works in memory – except for the CAS Disk Cache interactions during data loading phase, intermediate tables creation or paging when memory is over-committed), those values could be a limiting factor for SAS 9 I/O intensive solutions as it is likely that the general requirement of 100 or even 150 MB/s per core will NOT be met.

Local SSD Drives are physically attached to the server that hosts your virtual machine instance. They have a higher throughput and lower latency than standard persistent disks or SSD persistent disks.

Each local SSD is 375 GB in size, but you can attach up to eight local SSD devices for 3 TB of total local SSD storage space per instance. The disk interface can be either: SCSI (default) or NVMe (new). NVMe is usually faster but not all the images have the optimized drivers to take advantage of it.

So local SSD drives are a good fit for temporary storage location which requires high I/O throughput such as SASWORK/SASUTIL, and potentially, CAS_DISK_CACHE (although you will surely need to write a script whose role is to format, stripe, and mount those transient drives every time the machine reboots).

However, there is a big drawback currently in GCP when using the local SSD Drives. As clearly stated in the documentation, "YOU CANNOT STOP AND RESTART AN INSTANCE THAT HAS A LOCAL SSD.So it means that if you had plans to stop your machines during the night or downtime to save some Cloud money it is not possible...all you can do is to delete the machine.

When you do so, you lose not only the SSD drives content but also your machine settings, IP adresses and boot disk content.

It is not only ephemeral storage, it makes your whole instance ephemeral. It could be a big showstopper for your customer.

A way to use the fast SSD Drives (while keeping the flexibility to stop the costs when not used) would be through the automation of a complete rebuild/redeployment. Which is very doable for things like CAS Workers (thanks to our Ansible deployment playbook) but not so easy for things as SPRE or SAS 9 servers.

(Thanks to Ekaitz Goienola for the heads-up on this constraint)

Storage recommendations for SAS Viya

Generally, the choice of the storage to use, depends on how much space is needed and what are the performance characteristics required by the application.

Here are some recommendations for a SAS Viya deployment, based on what we know in terms of I/O profile for the various storage areas of SAS Viya.

Storage area	Machines	GCP Storage solution	Comment
SAS binaries and configuration(/opt/sas/viya)	All	Zonal Standard persistent disk	No need for fast storage but needs for persistence.
SASDATA location	SAS Programming Runtime (SPRE)	Zonal SSD persistent disk	Larger disks and instance with more vCPU will increase the maximum throughput.
CAS Data directory	CAS Controller	Zonal SSD persistent disk
SASWORK, SASUTIL	SAS Programming Runtime (SPRE)	Local SSD / Zonal SSD persistent disk	For local SSD, stripe at least 4 preferably 8 together to get the IO throughput for each instance*. When placed on striped SSDs, SAS WORK and SAS UTILLOC should share a single file system.
CAS Disk Cache	CAS Nodes	Local SSD

(*) Use RAID0 as using parity or mirroring RAID levels across local SSD devices does not provide any actual reliability or redundancy benefit. See this page for guidance.

Finally, it is always a good idea to ensure that your infrastructure meets the SAS I/O throughput requirements running the SAS IO test tool or even better an “ansibilized” version to really benchmark CAS workers parallel IO read/write activities.

Conclusion

Hopefully this article gave you some basics to engage the architecture design discussions if your customer is planning to deploy SAS Viya in the Google cloud.

One of the nice things in the Cloud world is that you can make a mistake.

For example, if you picked the wrong machine type it is not a big deal, you don’t have to decommission a machine, ask the customer to order a new one, wait for it to be delivered and cabled in the Data center…No! …All you need to do is to stop your Viya services (if they are already installed) then VMs, change the instance types and restart.

Another cool thing that you can do quite easily (especially in GCP) is to automate the whole deployment process. From the machines provisioning (using “gcloud” commands, Cloud Deployment Manager or the ansible “gce” module), the deployment (with Viya ARK for example for the pre-reqs and ansible for the Viya installation), up to the data loading and the “content” delivery steps.

If you take the time to implement such automated process, you can drastically improve the flexibility of your infrastructure and really benefits from the cloud model promise, such as:

Benchmark different sizing options. For example, depending on your products and use-case, it might be interesting to have less but more powerful CAS workers –it could reduce the amount of data that needs to transit on the network during the CAS shuffle operations.
Reduce the infrastructure costs by wiping out the whole infrastructure when you don’t need it and rebuild it automatically only when required (for example to run a monthly forecast across many workers).
Easily take advantage of the new instance type when they are made available and match the SAS Viya requirements
Etc…

Thank you for reading and my special thanks to Erwan Granger, Mike Goddard, Simon Williams, Margaret Crevar, Mark Schneider for their reviews and contributions.