SAS Communities Library

We’re smarter together. Learn from this collection of community knowledge and add your expertise.
BookmarkSubscribeRSS Feed

Accessing files on Google Cloud Storage (GCS) through Google BigQuery

Started ‎10-14-2020 by
Modified ‎10-14-2020 by
Views 5,479

Last stop on the road to Google Cloud Storage. In previous articles, I’ve been exploring various ways to access Google Cloud Storage from SAS and CAS:

Today, I’m going to focus on using Google BigQuery as a middleware between SAS and Google Cloud Storage.

 

Indeed, Google BigQuery naturally interacts with Google Cloud Storage. It provides very simple ways to inject various file formats from GCS into BigQuery in one command line. The most popular big data file formats are supported (Avro, Parquet, ORC) as well as commodity file formats like CSV and JSON.

 

And since SAS can access Google BigQuery, SAS can access those GCS resources under the covers.

 

You might think at first look that it’s a little bit cumbersome, not direct and requires data duplication. Not really. Google BigQuery provides features that are similar to what Athena offers in the AWS world: interactive queries on data files stored in an object storage infrastructure. This is really useful with the adoption of object storage as data lake support.

 

In Google BigQuery, data files residing in GCS can be imported and thus data is duplicated in BigQuery as TABLES. Or they can simply be referenced as EXTERNAL TABLES, without any duplication. The concept is similar to a view.

 

nir_post_59_01_flow-1024x476.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The advantage is that if some data files are updated in Google Cloud Storage, querying the EXTERNAL TABLE in BigQuery will take into account the last updates of the files.

 

The bq CLI utility (part of the GCP’s CLI) is the tool to create EXTERNAL TABLES based on GCS files:

 

nir_post_59_02_bqmk.png

 

EXTERNAL TABLES are usable like any other object in the BigQuery query editor:

 

nir_post_59_03_bigquery.png

 

As well as in SAS:

 

libname sprebq bigquery cred_path="/opt/gcs/gcpdm_gcp_key.json"
    project="sas-gelsandbox" schema="gcpdmtest" max_char_len=80 ;

/* Process an External Table */
proc freq data=sprebq.ext_userdata_from_avro ;
    tables gender ;
run ;

cas mysession ;

caslib casbq datasource=(srctype="bigquery",credfile="/opt/gcs/gcpdm_gcp_key.json",
    project="sas-gelsandbox",schema="gcpdmtest",max_char_len=80,readbuff=32767) libref=casbq ;

/* Load an External Table in CAS */
proc casutil incaslib="casbq" outcaslib="casbq" ;
    load casdata="ext_userdata_from_avro" casout="ext_userdata_from_avro" replace ;
quit ;

cas mysession terminate ;

 

In conclusion, SAS can (also) access data files in Google Cloud Storage using SAS/ACCESS to BigQuery. This way enables SAS to support major big data file formats (Avro, Parquet, ORC) including formats that SAS does not support directly (Avro).

 

Thanks for reading.

Version history
Last update:
‎10-14-2020 05:33 PM
Updated by:
Contributors

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Labels