BookmarkSubscribeRSS Feed

Accessing files on Google Cloud Storage (GCS) through Google BigQuery

Started ‎10-14-2020 by
Modified ‎10-14-2020 by
Views 4,351

Last stop on the road to Google Cloud Storage. In previous articles, I’ve been exploring various ways to access Google Cloud Storage from SAS and CAS:

Today, I’m going to focus on using Google BigQuery as a middleware between SAS and Google Cloud Storage.

 

Indeed, Google BigQuery naturally interacts with Google Cloud Storage. It provides very simple ways to inject various file formats from GCS into BigQuery in one command line. The most popular big data file formats are supported (Avro, Parquet, ORC) as well as commodity file formats like CSV and JSON.

 

And since SAS can access Google BigQuery, SAS can access those GCS resources under the covers.

 

You might think at first look that it’s a little bit cumbersome, not direct and requires data duplication. Not really. Google BigQuery provides features that are similar to what Athena offers in the AWS world: interactive queries on data files stored in an object storage infrastructure. This is really useful with the adoption of object storage as data lake support.

 

In Google BigQuery, data files residing in GCS can be imported and thus data is duplicated in BigQuery as TABLES. Or they can simply be referenced as EXTERNAL TABLES, without any duplication. The concept is similar to a view.

 

nir_post_59_01_flow-1024x476.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The advantage is that if some data files are updated in Google Cloud Storage, querying the EXTERNAL TABLE in BigQuery will take into account the last updates of the files.

 

The bq CLI utility (part of the GCP’s CLI) is the tool to create EXTERNAL TABLES based on GCS files:

 

nir_post_59_02_bqmk.png

 

EXTERNAL TABLES are usable like any other object in the BigQuery query editor:

 

nir_post_59_03_bigquery.png

 

As well as in SAS:

 

libname sprebq bigquery cred_path="/opt/gcs/gcpdm_gcp_key.json"
    project="sas-gelsandbox" schema="gcpdmtest" max_char_len=80 ;

/* Process an External Table */
proc freq data=sprebq.ext_userdata_from_avro ;
    tables gender ;
run ;

cas mysession ;

caslib casbq datasource=(srctype="bigquery",credfile="/opt/gcs/gcpdm_gcp_key.json",
    project="sas-gelsandbox",schema="gcpdmtest",max_char_len=80,readbuff=32767) libref=casbq ;

/* Load an External Table in CAS */
proc casutil incaslib="casbq" outcaslib="casbq" ;
    load casdata="ext_userdata_from_avro" casout="ext_userdata_from_avro" replace ;
quit ;

cas mysession terminate ;

 

In conclusion, SAS can (also) access data files in Google Cloud Storage using SAS/ACCESS to BigQuery. This way enables SAS to support major big data file formats (Avro, Parquet, ORC) including formats that SAS does not support directly (Avro).

 

Thanks for reading.

Version history
Last update:
‎10-14-2020 05:33 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels