CI/CD: Keeping Up with the SAS Viya Infrastructure-as-Code Project

3 Likes

Recently I was running yet another deployment of SAS Viya to the AWS cloud. In particular, I was stepping through my established process to ensure it still worked with the stable-2021.1.5 release. But early on, I hit a snag. The challenge occurred before I even got to download my SAS Viya order. It happened when I tried to provision my cloud hardware as the place to run SAS Viya. Of course, I was using the Infrastructure as Code project, viya4-iac-aws, to do that. And I thought it would be helpful to share it with you - but not to solve just this one problem. This is just an example of what might happen. Now that we're optimizing for continuous improvement and continuous delivery (CI/CD) of software, this kind of problem can happen at any time. We all need to practice improving our skills at troubleshooting the kind of problems which will occur as changes are continuously introduced to the software, the deployment toolchain, the infrastructure, and so on.

viya4-iac-aws

The viya4-iac-aws project (and variations for -azure and -gcp) is very useful and helps to make quick work of standing up cloud infrastructure that's needed for SAS Viya. As a guiding concept, remember that the sample files are provided to quickly provision a pre-determined set of machines, of course, but also act as a kind of guide to follow when crafting infrastructure specifically suited to your customer's needs. It's expected you'll need to make changes - that the IaC sample provisioning files are likely not what your customer will actually use for a real working environment.

What I Saw

I already have an established process consisting of files and scripts that I've successfully used to deploy SAS Viya many times over. The last time that I ran it was a couple of weeks ago. I was ready to give it another go when the stable-2021.1.5 release came out.

So keep in mind from this point that the challenge is not with SAS Viya - because I haven't even gotten to download the order yet.

When I got to the step where I run the terraform plan command - which takes as input a .tfvars file that I built based on the IaC samples to describe the Kubernetes cluster I want - it failed saying:

│ Error: Operation failed
│
│   on security.tf line 23, in resource "aws_security_group_rule" "vms":
│   23:   count             =  ( ( (var.storage_type == "standard" && var.create_nfs_public_ip) || var.create_jump_vm )
│     ├────────────────
│     │ var.create_nfs_public_ip is null
│     │ var.storage_type is "standard"
│
│ Error during operation: argument must not be null.

Basically, Terraform is complaining it cannot build the plan file because one variable in particular - var.create_nfs_public_ip - isn't set properly. The whole point of the .tfvars file is to define variables - and sure enough, when I look in mine, there's no reference to that var.create_nfs_public_ip variable.

Determining the Root Cause

I download and build my local copy of the viya4-iac-aws project each time I stand up a new environment. And my procedure relies on using the latest version of viya4-iac-aws. As discussed in my previous post about Contemplating Version Pinning for CI/CD, this could lead to unexpected errors over time as the viya4-iac-aws project is updated with new features and improvements.

When I see an error message like the one above - where Terraform is stumbling on an undefined variable - it triggers me to question what new functionality has been added to the viya4-iac-aws project. For this particular variable, I go and look at the samples to see when they were last updated and what changes were made.

That's when I saw that viya4-iac-aws/examples/sample-input-minimal.tfvars was changed recently with some new lines added at the end:

#Jump Server
create_jump_vm= true
jump_vm_admin= "jumpuser"
jump_vm_type= "t3.medium"

#NFS Server
#required ONLY when storage_type is "standard" to create NFS Server VM
create_nfs_public_ip= false
nfs_vm_admin= "nfsuser"
nfs_vm_type= "m5.xlarge"

Whereas my current project's .tfvars file simply ends with:

#Jump Server
create_jump_vm= true

In short, the IaC team has modified their process slightly to make the creation of an NFS server parameter driven using these new variables (as well as a change to the Jump server's creation process, too).

How I Fixed It

The fix here is pretty easy - I added the new lines to my local .tfvars file and re-ran the terraform plan step. After that completed successfully, I continued on with the rest of my deployment process.

Once I was happy with the outcome, then I updated my saved .tfvars file so future iterations of the process will get those new variables as well.

This is the right approach for me because I want to keep up with the latest changes in the IaC and I'm comfortable troubleshooting challenges on short notice. But this might not be the right approach for you or your customer depending on your objectives.

The alternative would be to use an older version of the viya4-iac-aws project prior to the introduction of these new variables. Looking at the Tags page for viya4-iac-aws in Github, release versions are shown as:

Select the image to see a larger version.
Mobile users: To view the image, select the "Full" version at the bottom of the page.

Therefore, another approach I could've used to fix this problem would be to select a slightly older release of the viya4-iac-aws project and use that instead. That way, the IaC would use its older method for provisioning the NFS server (and Jump server) that doesn't need the new variables.

Switching to a different tag is pretty easy. From the host machine where you're running the viya4-iac-aws project, get a listing of available tags:

[cloud-user | viya4-iac-aws]$ git tag -l
0.6.0
0.7.0
1.0.0
1.1.0
1.1.1
2.0.0
2.1.0
3.0.0

Then use the checkout command to switch over to a different tag (choosing 2.1.0 here):

[cloud-user | viya4-iac-aws]$ git checkout tags/2.1.0
Note: switching to 'tags/2.1.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
	
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
	
  git switch -c 

Or undo this operation with:
	
  git switch -

Turn off this advice by setting config variable advice.detachedHead to false
	
HEAD is now at 882ea0f Auto Scaling (#72)

Since we're just wanting to use the files in here - not make changes to push back to the remote git repo - we can ignore the suggestion about branching provided.

Pinning to a particular version in this way is helpful to help maintain a repeatable process that's likely to break less frequently. The tradeoff is that you shouldn't allow yourself to get too complacent and ignore the march of time. Eventually, even pinning to a known-good version is likely to fail after enough time has passed because of changes to the cloud provider, security enhancements, etc. And if you've successfully ignored a large number of changes over a long period, then trying to bring this step of the process forward to the current day might entail a lot of work on your part.

Takeaway

The interesting thing here is how much control we now have over what we're working with - including the ability to rollback to earlier releases as needed.

Even so, this is a pretty simple example of the kinds of challenges you might face working with SAS and other vendors' tools that rely on CI/CD concepts for delivery and operation. Troubleshooting these kinds of challenges with automated scripting tools like the viya4-iac-aws project (and viya4-deployment project) is becoming a normal part of the job for SAS personnel working with CI/CD pipelines that range from informal to highly structured. We're expected to identify the primary problem, sleuth around to find the answer, and then craft the appropriate solution with an understanding in-line with the long-term objectives. These are baseline skills we all need to practice and master.

Going further, if the problem turns out to be a bug in the code, then it's good practice to report the issue to the project at a minimum. Even better is if you have the ability to fix the problem, then you might consider branching the project, coding in the fix, and then sending a pull request so that other users of the project can benefit from your solution.

SAS Communities Library