BookmarkSubscribeRSS Feed

Accessing Google Cloud Storage (GCS) with SAS Viya 3.5 – An overview

Started ‎07-01-2020 by
Modified ‎07-01-2020 by
Views 3,578

With the rise of cloud data lakes, resulting from the convergence of data lakes and cloud computing, being able to access a wide variety of data on the major cloud providers' object storage technologies has become essential if not already mandatory.

 

Object storage provides a reliable, affordable and flexible way of storing, managing and using data from cloud platforms. The cloud providers such as AWS, Microsoft and Google offers plenty of data services that leverage their corresponding object storage implementation, making them very efficient. Ingesting data has never been so easy.

 

SAS has started to work with cloud object storage in 2016, releasing PROC S3 to access files located in AWS S3 from SAS. Since then, we have had many more capabilities:

But what about Google Cloud Storage (GCS), Google Cloud Platform (GCP)'s object storage?

 

Is SAS or CAS able to access GCS contents? If yes, how?

 

Well, as of SAS Viya 3.5, you probably noticed that we don't have a direct access to GCS through a CASLIB, a LIBNAME or a FILENAME construct. That will certainly come in the future.

 

In this article, I will briefly describe what some current options to access data in Google Cloud Storage are, before covering them in more details in upcoming blog articles.

gsutil - the "indirect" way

For SAS and CAS to be able to access files in GCS, you can certainly choose to make those files available beforehand on a file system accessible from SAS and/or CAS using the Google Cloud SDK.

 

gsutil is the CLI for Google Cloud Storage. It's a very simple utility to query, download, upload, synchronize files from/to GCS.

 

So, if you have files in GCS that you would like to access from SAS and/or CAS, you can run a gsutil command to synchronize them in a local directory. Outside of SAS or from SAS if you like running X commands.

 

Example:

 

# Synchronize a bucket (source) with a local directory (destination)
gsutil -m rsync -d -r gs://gcpdm-test/data /tmp/gcpdm_data_rsync

 

Pros: easy, robust, fast (parallel)

Cons: not a direct access from SAS (2-step process), requires Cloud SDK installation, requires some administration, requires some disk space

REST API - the "web" way

Google Cloud provides tons of APIs, including one of course for Google Cloud Storage. From SAS, it's possible to call this GCS API using PROC HTTP and thus to download files from GCS into SAS or upload files from SAS to GCS. FILENAME URL can also be used with signed URLs.

 

Example:

 

/* Download the file using the GCS REST API */
proc http
    url="https://www.googleapis.com/storage/v1/b/gcpdm-test/o/data%2Fcontact_list.csv?alt=media"
    oauth_bearer="&GCSTOKEN"
    out=outcsv ;
    debug level=1 ;
run ;

 

Pros: direct access

Cons: direct access is limited to SAS (not CAS), management of OAuth tokens

gcsfuse - the "dark" way

"gcsfuse is a user-space file system for interacting with Google Cloud Storage". In other words, gcsfuse is a command-line utility allowing you to mount a GCS bucket to a local directory so that the bucket's contents are visible and accessible locally like any other file.

 

Access to the bucket is totally transparent. Any new file in the bucket will be immediately visible in the mount point directory. Any new file in the mount point directory will be immediately visible in the bucket.

 

From a SAS and CAS perspective, it's nothing else than accessing OS directories.

 

Pros: direct, transparent, data available for SAS and CAS

Cons: some limitations with regard to writing files to GCS, gcsfuse sustainability, permissions management, linux only, requires gcsfuse installation

BigQuery - the "smart" way

Google BigQuery is a "fully managed, serverless enterprise data warehouse that supports analytics over petabyte-scale data". Besides its querying capabilities, Google BigQuery provides simple ways to import or reference data files located in Google Cloud Storage. For instance, you can create an External Table from a GCS file (basically it's a view on that file) and you will be able to use it from SAS and/or CAS. If the original file is updated, you will get the latest updates in SAS/CAS through the External Table.

 

And cherry on the cake, BigQuery supports additional files formats like Avro so that you can load an Avro file from GCS to CAS through Google BigQuery very easily.

 

Pros: direct (with BigQuery External Tables), openness to file formats not currently supported by SAS/CAS

Cons: additional SAS/ACCESS license, BigQuery transactions cost, linux only, one-way only (GCS -> BigQuery -> SAS/CAS)

 

If you have used any other way to access GCS from SAS that you want to share, feel free to comment.

 

Stay tuned for more details on each option.

 

Thanks for reading.

Version history
Last update:
‎07-01-2020 10:46 AM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started