BookmarkSubscribeRSS Feed

Using Apache Airflow to automate SAS Viya administration tasks

Started ‎09-19-2023 by
Modified ‎09-19-2023 by
Views 919

In previous posts published in the early days of SAS Viya on Kubernetes, I introduced the benefits of running the SAS Viya CLI inside a docker container and discussed how you can orchestrate Viya administration tasks using a CLI container and Argo Workflow. You can catch up on updated versions of these two posts here:

 

 

At SAS recent developments around orchestration and scheduling have focused on using Apache Airflow to schedule and manage flows. The SAS Airflow Provider allows SAS Viya administrators/power users to orchestrate SAS jobs and SAS Studio

flows using Apache Airflow. Argo Workflow allowed us to orchestrate common administration tasks performed with the SAS Viya CLI. With the focus now on Apache Airflow I wondered if it could do the same. The short answer is yes, read on for the details.

 

Apache Airflow v Argo Workflow

 

01_gn_aiflow_000.png

Select any image to see a larger version.
Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.

 

Firstly let's compare the two applications. Apache's Airflow and Argo Workflows are two of the most popular workflow engines available. A workflow engine is a platform for scheduling, starting, stopping, organizing, and monitoring flows containing sets of related tasks. How do Apache airflow and Argo workflow compare? First, the similarities. They both use Directed Acyclic Graph (DAG). A DAG models multi-step workflows as a sequence of tasks and includes the dependencies between tasks. They can both be used in Kuberenetes, Argo is entirely Kuberenetes native while Airflow is Kubernetes-friendly, and can be run in a Kubernetes cluster to take advantage of the increased stability and autoscaling options that Kubernetes provides. The big difference between the two is how you define your DAG workflows, in Argo you define them in YAML whereas in Airflow you define them in Python. (You have to say that Argo does have the cooler logo :))

 

Apache Airflow and the SAS Viya CLI

 

Apache Airflow is modular and can interface with other software by installing additional packages, called providers or operators. The SAS Airflow Provider is a good example of this, it was created to allow Airflow to run SAS processing. Other commonly used operators include, for example, the Bash operator and the Python operator. The Bash operator can run shell scripts or a set of commands and would be one option for running the sas-viya CLI commands. You can see an example of this in Michael’s post here. However, the obvious choice for us is the Airflow KubernetesPodOperator which executes a task defined as a Docker image in a Kubernetes Pod. Let’s look at how we run our sas-viya administration workflow with Airflow and the KubernetesPodOperator.

 

In our environment, Apache Airflow is installed and configured in Kuberenetes as described in Nicolas's post here. One of the key aspects of integration is that the environment is configured so that Airflow DAGs (Python scripts) are automatically discovered by the Airflow framework when they are saved in a specific directory.

 

Define and Run an Administrative Workflow

 

Here is how we define and run a Viya administration workflow in Airflow. Our flow will automate the performance of the initial administration tasks for a new Viya environment. It will:

 

  1. Setup identities
  2. Create a folder structure
  3. Apply an authorization schema to the folder structure
  4. Create Caslibs for data access
  5. Load Data to CAS
  6. Apply CAS authorization
  7. Load Viya Content
  8. Validate that the previous steps were successful

 

The first thing we need is a container image that includes the CLI and any dependent software. How we achieve this was covered in this post. Next, we define our Python script, which will create the flow. Let’s take a walk through the script.

 

At the start, we import the Python Packages we will need in the script. Then define the Volumes, Volume mounts in a Python list, and the environment variables in a Python dictionary. These elements will be passed to the KubernetesPodOperator. The ConfigMap volumes make the CLI configuration and credentials file and the Viya environment certificates available to each POD. The environment variables define the CLI profile to use and the paths to the certificate files. When the POD for each task is started the environment variables will be set and the volumes mounted.

 

02_gn_airflow_001.png

 

Next, we create the DAG giving it a name (01-load-content-flow) and setting default attributes including the schedule, start date, etc. In our example, we are going to create a flow that is not scheduled and we will trigger it manually in the UI. If you want to provide a schedule you provide it in Cron format.

 

03_gn_airflow_002.png

 

Following the DAG we define the tasks. Each task in our case uses the KubernetesPodOperator and will run a script that contains SAS Viya CLI commands. For each task, we provide the container image, give the task a name, and reference the volume, volume mount, and environment variables. We pass the bash -c command with a single argument which is the script that will run. Each script contains CLI commands that achieve the result of the task.

 

04_gn_airflow_003-1.png

 

The final part of the Python program defines the dependencies for the tasks in the DAG.

 

05_gn_airflow_004.png

 

Save the Python program to our directory that Airflow is configured to monitor and the DAG is imported to the Apache Airflow and we can view it in the application. Notice the dependencies setup in the DAG definition are honored.  As was mentioned before the flow can be scheduled or triggered from the UI.

 

Edited_06_gn_airflow_009.png

 

The application also allows you to monitor flows.  The screenshot below shows a flow in progress, the color-coding shows that 5 tasks have been completed successfully, one task is running and the final two tasks are waiting to start.

 

Edited_07_gn_airflow_006.png

 

When tasks are running or complete you can view the logs. In the screenshot below from the data load task we can see messages indicating that the data is being loaded to CAS.

 

Edited_08_gn_airflow_10-1.png

 

You can also use tools like kubectl or Lens to monitor the process.  A kubectl get pods shows a completed POD for each task in the flow (pods can also be automatically deleted on completion).

 

kubectl get pods -n airflow | grep task

 

09_gn_airflow_005.png

 

Wrap-Up

 

In this post we have looked at two Automation and Workflow Management tools; Apache Airflow and Argo Workflow. Both applications provide similar functionality. The main difference between the two applications is the language used to define flows. Apache Airflow uses Python and Argo Workflow uses YAML. Apache Airflow can be run in and take advantage of Kubernetes. In a prior post, we automated the execution of Viya administration tasks with Argo Workflow. In this post, we demonstrated that you can automate, schedule, execute, and monitor SAS Viya administration tasks using Apache Airflow, a containerized SAS Viya CLI and the Airflow KubernetesPODOperator.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎09-19-2023 08:17 AM
Updated by:
Contributors

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started