Manually destroying AWS infrastructure for SAS Viya

2 Likes

With SAS Viya, there's always something else new to learn. And of immediate interest is the upcoming support for deploying SAS Viya in Google Cloud Services and Amazon Web Services (AWS). In particular, I've been digging into the collateral that SAS is building to deploy SAS Viya into AWS and what that means for the field.

There are SAS teams working diligently on streamlining the entire deployment process, and for this post, I'm looking directly at provisioning the infrastructure that SAS Viya needs in AWS. One effort that's pretty slick is the SAS Viya 4 Infrastructure as Code for Amazon Web Services (viya4-iac-aws) project.

Take a look at sasviya4-iac-aws and you'll see it provides guidance and examples using Terraform to stand up hosts, network rules, an Elastic Kubernetes Services (EKS) cluster, IP addresses, and much more. It takes you from zero to hero to provide infrastructure that's ready for your Viya deployment.

But, of course, it didn't take long for me inadvertently cause a problem. Through my own error, I lost the ability to direct Terraform to destroy all of the items it created in AWS. Whoops. And that meant that I was now going to be responsible for finding and destroying everything manually. Throw in the myriad of AWS services where those items are defined as well as a pretty strict chain of dependencies between them and it can be an exercise in frustration. If you don't delete everything, then the next time you try to run terraform apply, it'll complain about resources already existing with the same name. Plus, a lot of those resources cost money… by the minute (or even second).

So follow along with me as a novice stumbling through this task to the finish line. The end goal will be some prescriptive guidance to help you if you face a similar circumstance trying to manually delete AWS resources in support of a SAS Viya deployment.

The problem

When Terraform runs its plan, it keeps track of the resources it creates in AWS in its tfstate file. For example, in your Terraform tfvars file (used to build the plan), you might describe the host(s) for CAS as having these attributes:

cas = {
"vm_type" = "m5.2xlarge"
"os_disk_type" = "gp2"
"os_disk_size" = 200
"os_disk_iops" = 0
"min_nodes" = 1
"max_nodes" = 5
"node_taints" = ["workload.sas.com/class=cas:NoSchedule"]
"node_labels" = {
"workload.sas.com/class" = "cas"
}

So, it's great we'll end up with up to 5 nodes of type "m5.2xlarge" for CAS, but which specific machine instances do they correspond to in our environment? If we open up the EC2 dashboard and look we'll find something similar to:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

That's what Terraform relies on its tfstate file for: to keep track of the specific instances of the resources we asked it to make.

When I ran the plan step in Terraform, it informed me what it would build, including this summary:

Plan: 91 to add, 0 to change, 0 to destroy.

And when I applied that plan, Terraform built those 91 items in AWS.

And when I went to destroy that infrastructure later, Terraform informed me:

Plan: 0 to add, 0 to change, 0 to destroy.

Yikes.

The problem I created for myself is that I was running Terraform inside the sasviya4-iac-aws:latest Docker container and neglected to map an internal container directory to my host to save the tfstate file permanently. In other words, Terraform created the AWS infrastructure I wanted, and then promptly forgot all about those 91 resources. And without the tfstate file, I couldn't tell Terraform to destroy those 91 items with a single, simple command. Oh boy.

Remember that containers are immutable. When you direct Docker to run a command inside a container, the work is done, but the contents of the container on disk are unchanged. So, for example, if you try to save a file inside the container, it'll be lost the next time the container is run. And that's what I did to my tfstate file. Docker provides the option for you to map a volume (a file or directory) on your host machine inside the container at runtime. Anything written there is written to the host, not the container itself. That's what I neglected to include. Here's a sample command:

$ docker run --rm -it -v /PATH_ON_HOST/FILES_SAFE_HERE:/workspace \
  --entrypoint=terraform viya4-iac-aws:latest \
  apply -state /workspace/terraform.tfstate "/workspace/sasviya4iac.plan"

Because of the -v switch, anything that is saved to /workspace inside the container is written safely to /PATH_ON_HOST/FILES_SAFE_HERE in my host operating system... where it can be re-used as needed. In the hilariously long and convoluted command above, you can see I have two files saved for posterity: the plan file (already created) and the tfstate file (being created when the plan is applied).

The Scope of the Problem

Alright, so there's 91 items I want to destroy in AWS. But guess what, there's no single list of all of them shown all together. Terraform used to have its own list in its tfstate file, but that evaporated before I could look at it.

Now, I know many of you have experienced deploying SAS Viya in Azure recently. And you're probably wondering why I don't just logon to the AWS Console (like Azure's Portal), find the resource group that contains all of my infrastructure, and delete it? Well, I wish I could. But AWS doesn't work that way.

So yes, AWS has a concept known as the resource group. But it's not like Azure's. In AWS, a resource group is really just an application of tags. And it's not sufficient to track all 91 items we're trying to find. Here, look:

There's just one resource group in my environment called "rocoll-rg". And as you can see, it's only tracking 31 resources. 60 are still in the wind.

Since we're talking about tags, AWS has another useful tool: the Tag Editor. You can use that to get a list of items with specific tag=value pairs. And I found that resourceowner=rocoll (my SAS userid) returns a list of 39 items.

Still not 91, but it's very helpful in providing pointers as to the AWS service and type of resources we're looking for.

Oh yeah, AWS services. I think there are more of those than actual items I'm trying to delete. Here's a screencap of the AWS services list with my "favorites" marked with stars:

Fortunately, we don't need to visit all of them. But 5 in particular are home to the infrastructure provisioned by sasviya4-iac-aws:

EC2: the virtual machines and their components
EFS: Elastic File Storage
EKS: Elastic Kubernetes Service
RDS: Relational Database Service
VPC: Virtual Private Cloud

Of course, your customer environment could be far more complex, involving other services and features of AWS. But as far as what sasviya4-iac-aws stands up using its example files, this is where you should expect to find the core set of resources for SAS Viya.

Sometimes you gotta run before you can walk

So with an incomplete view of the items in my AWS environment, let's get to work tearing it all down. Many items in the environment cannot be deleted if they're in active use. Keep in mind there are dependencies between items that must be honored to completely remove everything. With that said, I'll skip over my initial trial-and-error approach and provide a straightforward list of how I was able to completely destroy all resources in AWS as provided by the sasviya4-iac-aws project.

AWS > EC2 > Auto Scaling Groups

Delete all 6 auto-scaling groups associated with your name. This prevents the EKS service from automatically restarting the machine instances we intend to terminate. It takes a few minutes for deletion to complete. This will also terminate 8 of the 10 machines that were running.
AWS > EC2 > Instances

Terminate (not just stop) the remaining 2 machine instances.
AWS > EC2 > Volumes

Note the State column of this table shows these as "available", not "in-use". If they're in-use then you won't be able to delete. You might need to wait a few moments as the EC2 instances fully terminate.

Also assuming you chose storage_type="standard" in your Terraform vars file from sasviya4_iac_aws, then we can get rid of these NFS volumes. Alternatively, if you chose storage_type="ha", then look to delete resources in AWS > EFS instead.
AWS > EC2 > Key Pairs

Delete the key pairs.
AWS > Elastic Kubernetes Service > Clusters

Delete the EKS cluster associated with your name. This might take a few minutes, too.
AWS > VPC > NAT Gateways

Delete the NAT gateway from the VPC.
AWS > VPC > Your VPCs

Delete the VPC itself. If it complains that it cannot delete, you should be able to confirm you're just waiting on the NAT to finish deletion. Note also on the resulting confirmation dialog that this deletes a number of other resources automatically for you.
AWS > EC2 > Security Groups

Check here just in case. It should be empty, but if not, then delete the Security Groups associated with your name. Non-obvious interface tip: the Actions dialog box scrolls down to select the last item in the list.
AWS > EC2 > Elastic IPs

Release the EIPs. Disassociating isn't as good since the EIP will hang around not being used, but costing SAS money.
AWS > RDS > Subnet Groups

Delete the subnet group from the Relational Database Service.
AWS > Resource Groups > Tag Editor

Perform a search specifying Region=us-east-1, Resource Types=All Resource Types, Tag Key=resourceowner, and Tag Value=<your userid>.

The results shown are time-sensitive. As we delete items in the steps above, this list shrinks. However, some items remain in the list even after they're deleted, mostly just Instances and Volumes in EC2. We're confirming that's all in our list here, and since we know we've already deleted all instances and volumes in EC2, then we should be clear to proceed.
AWS > IAM > Policies

Delete the IAM policies… one… at… a… time.

Last time I checked, the Terraform destroy was not deleting the rocoll-eks-elb-sl-role-creation<…> policy automatically. You might have several in there. Delete any you find that appear associated with your Viya deployment.
AWS > Resource Groups > Saved Resource Groups > <your userid>-rg

We don't need the resource group anymore. To delete it, you must open your resource group directly to find the delete button (not simply select it from a list table and use an Action button as shown in prior steps).

Ready for more

With everything that viya4-iac-aws created finally deleted from AWS, then we can try running the Terraform apply again. And in my example situation, make sure to specify a location for the tfstate file that is persistent (i.e. not solely inside a container) so that we don't need to manually delete things all over again.

And here's a hint at the trial and error I attempted when working on this post. If you neglect to delete something that the sasviya4-iac-aws created, then when you're running Terraform apply, it'll eventually fail out complaining about something similar to:

Error: Error creating IAM policy rocoll_ebs_csi_policy: EntityAlreadyExists: A policy called rocoll_ebs_csi_policy already exists. Duplicate names are not allowed. 
status code: 409, request id: a80638a6-7815-45d9-a2b9-a3905f61bce4

Error: Error import KeyPair: InvalidKeyPair.Duplicate: The keypair 'rocoll-nfs-server-admin' already exists. 
status code: 400, request id: 2377695b-a9e3-4427-b6b6-a7487c3c54ce

As you can see, the problem is that Terraform cannot create AWS resources with the same name. That's your clue to find those items and delete them. Take care to also run the Terraform destroy at this moment before attempting yet another Terraform apply.

Make it better

I've still got a lot to learn about AWS and optimizing SAS Viya's deployment there. And so if you've got a simpler, more complete, or just plain better approach to deleting items when Terraform is unable to, then please share it in the comments below, or email me. I'm happy to update this post with any improvements.

The next step

Creating and managing resources in AWS is just a preliminary step. The end objective is to deploy SAS Viya software into the environment. Many of the same folks who brought us the sasviya4-iac project are also working on an effort to help make that goal easier to attain: see the SAS Viya 4 Deployment project in the sassoftware repository in Github.