BookmarkSubscribeRSS Feed

SAS Viya and Apache Airflow in Kubernetes: A Peaceful Coexistence

Started ‎05-05-2023 by
Modified ‎05-05-2023 by
Views 5,246

SAS has released SAS Airflow Provider earlier this year. This tool allows SAS administrators/power users to orchestrate SAS jobs and SAS Studio flows using Apache Airflow, an “open-source platform for developing, scheduling, and monitoring batch-oriented workflows”.

 

nir_post_87_01_dag_sample-1024x425.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Apache Airflow can be deployed in many ways and might already exist at your site. If not, then it might be the opportunity to think it twice and make SAS Viya and Apache Airflow work hand in hand. In this blog, we will look at some aspects of Apache Airflow deployment that can help you leverage a seamless integration between SAS Viya and Apache Airflow.

 

 

Generalities

 

Apache Airflow and Kubernetes

 

While it is not mandatory for Airflow to work with SAS Viya, Airflow can be deployed in Kubernetes which makes it flexible, cloud-native and elastic. An Official Helm Chart for Apache Airflow is available to deploy it in Kubernetes very easily.

 

Airflow can be deployed in the same Kubernetes cluster as SAS Viya, leveraging the same infrastructure, in a different namespace for software isolation.

 

If dedicated to SAS Viya, Airflow workload should only consist into calls of SAS jobs and SAS Studio flows which in turn run in the SAS Viya platform.

 

Finally, to be able to use the SAS operators in Airflow, the default container image used in the Airflow Helm chart needs to be extended to include SAS Airflow Provider.  

 

Apache Airflow and SAS

 

First, let’s remind that what we would probably call a process flow (a collection of tasks/programs/jobs organized in a specific sequence) is called a DAG (Directed Acyclic Graph) in Airflow and is defined in a Python script. So just code. No authoring UI is available out of the box.

 

nir_post_87_02_dag_code-1024x488.png

 

Also, Airflow DAGs (Python scripts) are automatically discovered by the Airflow framework when they are saved in a determined DAGs directory. That said, if Airflow and SAS Viya share the same DAGs directory, it is possible to define an Airflow DAG from within SAS Viya that is automatically discovered in Airflow facilitating the integration.

 

nir_post_87_03_dag_list-1024x373.png

 

 

In details

 

Step 1 – Extending the standard image

 

First step is to extend the default Airflow container image to include the pieces that will allow us to call SAS Viya jobs or flows. Indeed, Airflow comes out of the box with several providers (also called operators) which makes integration with third-party tools possible (essentially Airflow triggers jobs from various applications). Numerous providers are not included by default and need to be installed.

 

Below is an example of how to extend the default container image. Here, we use a Dockerfile with a requirements.txt file to build a new image that will include the SAS Airflow provider.

 

tee ./requirements.txt > /dev/null << EOF
sas-airflow-provider
EOF

tee ./Dockerfile > /dev/null << EOF
FROM apache/airflow:latest
RUN pip install --upgrade pip
COPY requirements.txt .
RUN pip install -r requirements.txt
EOF

docker build -t airflow-sas:1.0.0 .

 

Then you have to make this image available for your deployment (multiple options described in point 4 here). It can be done by pushing it to a remote registry:

 

docker tag airflow-sas:1.0.0 registry.example.com/project/airflow/airflow-sas:1.0.0
docker push registry.example.com/project/airflow/airflow-sas:1.0.0

 

 

Step 2 – Deploy Apache Airflow

 

We are ready to start the process of deploying. First thing is to customize the Airflow deployment. There are several properties that we want to modify.

You can have a look at all the chart’s properties here or here or you can write them down in a file to help you get started:

 

helm repo add apache-airflow https://airflow.apache.org
helm show values apache-airflow/airflow > values.yaml

 

We are going to focus on a few properties of interest, but I will provide the values.yaml file I use at the end:

 

defaultAirflowRepository: registry.example.com/project/airflow/airflow-sas
defaultAirflowTag: "1.0.0"

 

This is the repository address in your registry where you saved your customized container image. You also need to specify the saved tag name.

 

ingress.web.hosts: [airflow.example.com]
ingress.web.ingressClassName: "nginx"

 

You define the Ingress host (essentially the Airflow UI URL) and you can reuse the Ingress class name of your SAS Viya deployment.

 

extraEnv: |
  - name: AIRFLOW__CORE__LOAD_EXAMPLES
    value: 'True'
  - name: AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL
    value: '30'

 

You can add some environment variables. Here I want to get some sample DAGs already loaded and I want to reduce the scanning time of the DAGs directory – how often (in seconds) to scan the DAGs directory for new files – from 5 minutes to 30 seconds (demo settings to quickly discover a DAG in Airflow when a file is dropped in the DAGs directory).

 

volumes:
  - name: dags
    nfs:
      server: nfs.example.com
      path: /shared/gelcontent/airflow/dags
volumeMounts:
  - mountPath: '/opt/airflow/dags'
    name: 'dags'

 

Finally, I want to customize the default DAGs directory (the main directory that Airflow scans to detect new DAGs) to be a location on a NFS server shared between Airflow and SAS Viya. Indeed, that will be handy when I want to author a DAG from within SAS. I will save it to the DAGs directory, and it will automagically appear in Airflow.

 

Here is the complete values.yaml file I use:

 

# Default airflow repository -- overridden by all the specific images below
defaultAirflowRepository: registry.example.com/project/airflow/airflow-sas

# Default airflow tag to deploy
defaultAirflowTag: "1.0.0"

# Ingress configuration
ingress:
  # Enable all ingress resources (deprecated - use ingress.web.enabled and ingress.flower.enabled)
  enabled: ~

  # Configs for the Ingress of the web Service
  web:
    # Enable web ingress resource
    enabled: true

    # Annotations for the web Ingress
    annotations: {}

    # The path for the web Ingress
    path: "/"

    # The pathType for the above path (used only with Kubernetes v1.19 and above)
    pathType: "ImplementationSpecific"

    # The hostname for the web Ingress (Deprecated - renamed to ingress.web.hosts)
    host: ""

    # The hostnames or hosts configuration for the web Ingress
    hosts: [airflow.example.com]
    # - name: ""
    #   # configs for web Ingress TLS
    #   tls:
    #     # Enable TLS termination for the web Ingress
    #     enabled: false
    #     # the name of a pre-created Secret containing a TLS private key and certificate
    #     secretName: ""

    # The Ingress Class for the web Ingress (used only with Kubernetes v1.19 and above)
    ingressClassName: "nginx"

    # configs for web Ingress TLS (Deprecated - renamed to ingress.web.hosts[*].tls)
    tls:
      # Enable TLS termination for the web Ingress
      enabled: false
      # the name of a pre-created Secret containing a TLS private key and certificate
      secretName: ""

    # HTTP paths to add to the web Ingress before the default path
    precedingPaths: []

    # Http paths to add to the web Ingress after the default path
    succeedingPaths: []

  # Configs for the Ingress of the flower Service
  flower:
    # Enable web ingress resource
    enabled: false

    # Annotations for the flower Ingress
    annotations: {}

    # The path for the flower Ingress
    path: "/"

    # The pathType for the above path (used only with Kubernetes v1.19 and above)
    pathType: "ImplementationSpecific"

    # The hostname for the flower Ingress (Deprecated - renamed to ingress.flower.hosts)
    host: ""

    # The hostnames or hosts configuration for the flower Ingress
    hosts: []
    # - name: ""
    #   tls:
    #     # Enable TLS termination for the flower Ingress
    #     enabled: false
    #     # the name of a pre-created Secret containing a TLS private key and certificate
    #     secretName: ""

    # The Ingress Class for the flower Ingress (used only with Kubernetes v1.19 and above)
    ingressClassName: ""

    # configs for flower Ingress TLS (Deprecated - renamed to ingress.flower.hosts[*].tls)
    tls:
      # Enable TLS termination for the flower Ingress
      enabled: false
      # the name of a pre-created Secret containing a TLS private key and certificate
      secretName: ""

extraEnv: |
  - name: AIRFLOW__CORE__LOAD_EXAMPLES
    value: 'True'
  - name: AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL
    value: '30'

webserverSecretKey: d9e40acbe90806dd6fc30d67edd3bdd0

volumes:
  - name: dags
    nfs:
      server: nfs.example.com
      path: /shared/gelcontent/airflow/dags
  - name: plugins
    nfs:
      server: nfs.example.com
      path: /shared/gelcontent/airflow/plugins
  - name: scripts
    nfs:
      server: nfs.example.com
      path: /shared/gelcontent/airflow/scripts
volumeMounts:
  - mountPath: '/opt/airflow/dags'
    name: 'dags'
  - mountPath: '/opt/airflow/plugins'
    name: 'plugins'
  - mountPath: '/opt/airflow/scripts'
    name: 'scripts'

logs:
  persistence:
    # Enable persistent volume for storing logs
    enabled: true

 

Now is time to deploy Airflow:

 

helm repo add apache-airflow https://airflow.apache.org
helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace -f ./values.yaml

 

Validate it is working by opening the URL which is the Ingress you specified earlier ("http://airflow.example.com/") and connecting using the default user admin/admin:

 

nir_post_87_04_airflow_login-1024x745.png

 

First thing to check is the presence of the SAS Airflow provider:

 

nir_post_87_05_airflow_providers-1024x493.png

 

 

Demo

 

Now that we have Apache Airflow setup with the SAS Airflow provider and both Airflow and SAS Viya sharing the same DAGs directory, we can illustrate how SAS Viya and Airflow can work together. Let’s do this in video:

 

 

Some links of interest:

 

 

 

Find more articles from SAS Global Enablement and Learning here.

Comments

thank you brining this tool to attention. Most probably we are going to use it because of the coexistence with SAS. Without the knowledge we were thinking about Dagster. Demo is great. One clarification question, these customs steps which you are showing like :"airflow-add task", where can this be found? regards Karolina

@touwen_k  Thanks for your comments Karolina. Let me check what I can do to share those custom steps.

@NicolasRobert Hi Nicolas, thank you for a great article. Did you have a chance to share those custom steps that Karolina was asking about? That would be really helpful if you could provide some kind of repository with them.

Hello.

 

There is a public repository for crowd-sourced custom steps (https://github.com/sassoftware/sas-studio-custom-steps) that I recommend you to visit. Unfortunately, I haven't had time to publish the Airflow ones in it yet.

Feel free to leave me your email address through a private message and I will send them to you.

 

Regards,

Nicolas.

Hi Nicolas.  Did you ever share the custom steps to create the DAGs?  Very informative post by the way. Thanks Eoin.

The SAS Studio Custom Steps used in this demo are now available in the SAS Studio Custom Step GitHub repository:

https://github.com/sassoftware/sas-studio-custom-steps/tree/main/Airflow%20-%20Generate%20DAG

 

Does anyone have updated instructions to work on the latest SAS Viya 2024.02 ?

Version history
Last update:
‎05-05-2023 10:52 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started