How WIF can help you more securely access your data in Big Query and storage buckets from SAS Viya

1 Like

This blog is the second installment of a series where we discuss various use-cases around the benefits of using Google Workload Federation (WIF) with SAS Viya. See here for the first installment of the blog where we shared details on how WIF works and how it can be leveraged to provide access to a Google SQL PostgreSQL database server. In this blog, we’ll discuss more advanced use-cases like enabling access to Google Big Query and Google Cloud Storage. Let’s get started.

Use-case 2: Authenticating against Google Big Query through SAS/Access

The SAS Viya platform uses the SAS/Access engine to Big Query to access data that is stored there. The engine allows you to use a Google service account to access data that resides in Big Query. This is accomplished by referring to a service account key in json file format via the CREDFILE libname option. See the Access engine documentation for more information on this option. The SAS/Access engine uses the credentials contained in the service account key json file to authenticate against Google Big Query.

Why you should stop using service accounts to access data in Big Query

The above method of authentication comes with some disadvantages. It’s using long-lived credentials and if somebody can get access to this file they get access to the data. Another disadvantage is that this method of authentication doesn’t support user- or group-based access controls based on the current user. It’s nice to have a paper trail of who has accessed which data and be able to apply fine-grained access controls to data based on actual users or group memberships. Ideally, we want users that have logged on to the SAS Viya platform to be able to seamlessly access data in Google Big Query without providing additional credentials.

Start using WIF for accessing data in Big Query from SAS Viya

This is where Google Workload Identity Federation (WIF) comes into play. WIF is a mechanism that allows applications running outside of Google Cloud to authenticate and access Google Cloud resources without relying on long-lived credentials like service account keys. Trust is established between an external identity provider like MS Entra ID and Google Cloud through the usage of WIF. That allows federated identities to exchange their token for a Google Cloud token and access specific Google services that the identity is allowed to access.

The three main building blocks

The solution exists out of three main building blocks which will be discussed in more detail during the remainder of this blog. These building blocks are:

MS Entra ID application registration: setting up Single Sign-on (OpenID connect) for SAS Viya.
Google Workload Identity Federation: setting up the required components to enable WIF
SAS Viya: configuring the access engine to use WIF

In a simplistic manner, the resulting “triangle” looks like this:

MS Entra ID application registration

To enable Single Sign-on you need to create an MS Entra ID application registration and make some configuration changes to SAS Viya. This is outside of the scope of this blog post. You can find more information on how to set this up in our official documentation here.

The important thing is that this is an essential building block that should be in place before proceeding with the next steps in this blog.

Two things to check in the configuration before moving on to implement the example discussed in this blog:

The MS Entra ID Application registration ID. This is the unique identifier of the application. Please make a note of this unique identifier as this will be used later in the setup of WIF
Make sure that email is part of the claim as this is used in the example. If you followed the information on how to set this up provided in the link, then email should be part of the claim

To check if email is part of the claim, you can simply go to the Azure Portal. Then select Microsoft Entra ID.

On the left side of your screen, you will find the menu shown here on the image. Expand the item Manage. Under Manage you will find several sub-items. Select App Registration.

This is the place where your MS Entra ID application registration resides. Select the registration that is used in the SAS Viya platform to provide SSO.

Once the registration has been selected you can confirm if email is part of the claim by selecting Manage --> Token configuration

This will show the current setup of the claim on the screen

Confirm if email is part of the claim. If not, please revisit the instructions on setting up SSO for SAS Viya to add it. This is also where you can modify the claim and add additional attributes that can be used in securing your data in Google Big Query through WIF.

Later in this blog this will be discussed in more detail.

Google Workload Identity Federation

Two components need to be created to set up Workload Identity Federation on the Google side. Make sure you have the proper permissions to create these components. You can either use the Google Portal or the Google CLI, gcloud, for these steps. We’ll show you the CLI commands in this blog.

Workload Identity Federation pool

To create the Workload Identity Federation pool execute the following statements. Modify accordingly to match your situation.

export pool=<PREFIX>-entraid-pool

gcloud iam workload-identity-pools create $pool \
    --location="global" \
    --description="test with identity pool" \
    --display-name="$pool"

Workload Identity Federation provider

Once the Workload Identity pool has been created, we can create the provider within the pool. Modify accordingly to make sure all placeholders match your situation.

export provider=<PREFIX>-entraid-provider

export entra_appID=<ENTRA-APP-ID>

gcloud iam workload-identity-pools providers create-oidc $provider \
    --location="global" \
    --workload-identity-pool="$pool" \
    --issuer-uri="https://login.microsoftonline.com/<TENTANT-ID>/v2.0" \
    --allowed-audiences="$entra_appID" \
    --attribute-mapping="google.subject=assertion.email"

There are a couple of things to point out here:

Issuer URI: This is the URI of the issuer of the token. In this case it refers to a URI provided by Microsoft and should contain your TENTANT ID.
Allowed Audiences: this is the Application ID of the MS Entra ID application registration used in the setup of Single Sign-on for SAS Viya. Setting this will only allow the MS Entra ID application to interact with this endpoint.
Attribute mapping: you can provide a mapping for any attribute that is a part of the claim. In this example I’m using e-mail which is should be part of the claim by default.

Setup permissions on Google Big Query

Now that all required components for Workload Identity Federation are in place, permissions can be assigned to either a database or a table within Big Query through the usage of principals. A principal is basically a mapping between the federated identity and a Cloud IAM role and the object (a database or a table) within Google Cloud (Big Query for instance).

Let’s break down the below example of a principal and explain some of the components

principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/POOL_ID/subject/SUBJECT

PROJECT_NUMBER: this is a technical identifier of your Google Cloud Project.
POOL_ID: this is the technical identifier of the Workload Identity Federation Pool that was created earlier in this blog
SUBJECT: this is the email address of the federated identity

In the next paragraph you will find an example of what that looks like when you assign a federated identity the Big Query Viewer Cloud IAM role through a principal on a Big Query table.

An example of assigning a principal to a role for a Big Query table

To assign permissions to a table in BigQuery, open the interface through the browser, select a database and then select a table. Click on the share button and this will open the window shown below.

Select add principal and then provide the principal.

Then assign a role to that principal.

Configuring the access engine to use WIF

Creating credential configuration

For the SAS/Access engine to Big Query to use WIF a file will be created that contains information about the Workload Identity Federation pool. The information in the file also tells WIF where it can find the token and which Google endpoint it should use to swap a token for a Google Cloud token.

This file is called credential configuration and is in json format. This file will be referenced in the libname statement for Big Query using the credfile option. That’s why this file needs to be stored somewhere that is accessible by compute sessions in the SAS Viya platform.

Use the code below to generate the file:

project_number=$(gcloud projects describe $(gcloud config get-value core/project) --format=value\(projectNumber\))

pool=<ID OF POOL>
provider=<ID OF PROVIDER>

cat <<EOF> /tmp/credentials-config.json
{
    "universe_domain": "googleapis.com",
    "type": "external_account",
    "audience": "//iam.googleapis.com/projects/$project_number/locations/global/workloadIdentityPools/$pool/providers/$provider",
    "subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
    "token_url": "https://sts.googleapis.com/v1/token",
    "credential_source": {
        "file": "/tmp/token.json"
    }
}
EOF

Testing the connection

Here’s a sample SAS code to test the connection to Big Query using WIF.


/* Macro that writes out the ID token to the compute session temp directory. This token is used to swap the token for a Google Cloud Token */

%get_identity_token;

/* The credfile option points to the credentials config json file which tells the SAS Access engine how it can retrieve the Google token and which toke to use   */

options sastrace='d,d,d,d' sastraceloc=saslog nostsuffix msglevel=i ;

libname gbqlib2 BigQuery project="bq_data" schema="HMEQ" credfile='/homes/&_CLIENTUSERID/credentials-config.json';

data test;
 set gbqlib2.ID_HMEQ(obs=1);
run;

Let’s look at the code in more detail:

Get identity token: this is a macro that retrieves a TOKEN from the SAS Viya platform. This token is then passed to Google to exchange it for a token issued by Google.
Credentials configuration: referencing the credentials configuration through the credfile libname statement option, the SAS/Access engine for BigQuery will use Workload Identity Federation to access the data.
The datastep will retrieve one row from a table within Big Query. This is the part in the code that triggers the exchange of tokens through Workload Identity Federation. Once tokens are exchanged, access to the data will be validated using your external identity. If access is granted through a principal, data can be retrieved.

What if you want to use groups to control access to Big Query data

To explain the concept of Workload Identity Federation and accessing data in Big Query from SAS Viya an example was given in which the data is secured using an email address. In real life situations, the usage of groups is preferred instead of using email addresses to secure data. The security model requires less maintenance that way!

Can this be done using WIF? The short answer is yes!

To use groups from an external Identity provider with Workload Identity Federation three things need to be modified.

Adding the groups claim to the MS Entra ID application.
The attribute mapping of the Workload Identity Federation pool provider needs to be modified by adding an additional attribute to the mapping.
Assigning permissions to a specific group on a specific table.

Adding the groups claim to the MS Entra ID application

Adding the groups claim to the MS Entra ID application can be done through the Azure portal. You need to have owners’ permission to be able to make these modifications.

Groups claim can be added by selecting manage à token configuration and then clicking on add group claim.

Now you are prompted to select which type of group is included in the claim. Please consult an MS Entra ID admin to determine which type is the proper type for your organization as this might differ from the above screenshot.

Regarding the token properties sAMAccountName is used, as this provides the names of the groups that are used in the principal. But there are more options. Again, discuss this with an MS Entra ID admin to determine the correct attribute to select in your organization’s setup.

Modifying the attribute mapping of the WIF provider

Modifying the attribute mapping is straightforward. Assuming that all the components are still in place after getting the provided example in this blog working, you can simply modify the provider through the Google cloud console as shown below.

Assigning permissions for a group to a table in Big Query

Once the previous modification is in place, you can start assigning permissions to a table in Big Query. Follow the steps as described previously in this blog with one exception in the definition of the principal.

Instead of using principal you need to use principalSet for groups
Instead of using subject, you need to use group.
Change myGroup to match the name of the group you wish to assign specific permissions

Other than that, the string is similar as to what we discussed earlier in the blog. Assign a role and apply the principalSet to the table.

The code that was used earlier to test our setup doesn’t require any modifications. Use that code to validate that you can access the data that resides in Big Query.

Use-case 3: Authenticating against Google Cloud Storage

Let’s move on to the final use-case we want to talk about in this blog: accessing data which is stored on Google’s object storage solution, GCS. The requirements for this scenario are the same as for Big Query: we want to avoid using passwords in our codes and we want to be able to use user- or group-based access controls to protect the data.

While SAS in general can access data on GCS by using PROC S3 or a FILENAME statements, the use of WIF complicates things because it is a proprietary IAM framework of Google cloud and not directly supported by these approaches. Rather, we need to look at the REST API offered by GCS, which offers the usual HTTP verbs like GET, PUT, DELETE and HEAD etc. for manipulating the contents of your buckets.

If you have followed our blog so far, you already know that WIF uses Google’s STS service to swap out the OIDC ID token we can offer for an access token that is understood by other Google services. Using it as our Bearer token, we can then make a call to the GCS REST API to perform our desired action, e.g. download a file from a bucket. Both steps can be performed using PROC HTTP.

Here’s a sample SAS program to show how this works.

/* Macro that writes out the ID token to the compute session temp directory. This token is used to swap the token for a Google Cloud Token */

%get_identity_token;

/* Ask Google STS to swap the ID token for an access_token */
filename resp2 temp;
proc http
  method=POST
  out=resp2
  url=https://sts.googleapis.com/v1/token
  query=(
    "audience"="//iam.googleapis.com/projects/$project_number/
      locations/global/workloadIdentityPools/$pool/providers/$provider",
    "grant_type"="urn:ietf:params:oauth:grant-type:token-exchange"
    "requested_token_type"="urn:ietf:params:oauth:token-type:access_token"
    "scope"="https://www.googleapis.com/auth/cloud-platform"
    "subject_token_type"="urn:ietf:params:oauth:token-type:jwt"
    "subject_token"="&my_azureToken")
;
run;

libname output2 json fileref=resp2;
data _null_;
set output2.root;
call symput('my_gcpToken',access_token);
run;

/* REST call to download from the bucket using the access_token */
filename out TEMP;
%let BUCKET_NAME=my-bucket;
%let OBJECT_NAME=data.csv;

proc http
   url=https://storage.googleapis.com/storage/v1/b/&BUCKET_NAME./o/
     &OBJECT_NAME.?alt=media
   method="GET" out=out
   OAUTH_BEARER="&my_gcpToken.";
run;

proc import datafile=out out=mydata dbms=csv replace;
     getnames=yes;
run;

proc print data=mydata; run;

Keep in mind that you need to have completed the WIF and MS Entra ID configuration as described before.

Conclusion

Workload Identity Federation allows you to use a federated identity to access and secure data that is stored within your Big Query database or a Google Cloud Storage Bucket. It removes the need of using long-lived credentials like a Google service account to access your data. That eliminates a big security risk. Another welcome feature of WIF is that it enables you to apply access controls to data using either user or group information from an external identity provider. That eliminates the need to manage a separate security model in Google Cloud IAM.

One final question remains: are you going to start using WIF to access data in Big Query and Google Cloud Storage Buckets from SAS Viya?

Let us know in the comments!