BookmarkSubscribeRSS Feed

Accessing files on Google Cloud Storage (GCS) using SAS Viya 3.5 and Cloud Storage FUSE (gcsfuse)

Started ‎09-04-2020 by
Modified ‎09-04-2020 by
Views 3,068

In a previous article, I wrote about the current possibilities available to access files stored in Google’s object storage implementation: Google Cloud Storage (GCS).

 

Let’s deep dive and see how we can access files in GCS using Cloud Storage FUSE.

 

Cloud Storage FUSE is an open source FUSE adapter that allows you to mount Cloud Storage buckets as file systems on Linux or macOS systems.”

 

Essentially, Cloud Storage FUSE provides a command-line utility, named “gcsfuse”, which helps you mount a GCS bucket to a local directory so that the bucket’s contents are visible and accessible locally like any other file.

 

Access to the bucket is totally transparent. Any new file in the bucket will be immediately visible in the mount point directory. Any new file in the mount point directory will be immediately visible in the bucket.

 

From a SAS and CAS perspective, it’s nothing else than accessing OS directories.

 

There’s a warning though (see the “Caution” section in the documentation):

 

Cloud Storage FUSE is a Google-developed and community-supported open-source tool, written in Go and hosted on GitHub. It is distributed as-is, without warranties of any kind.

 

So, it’s probably good for experimentation, performance testing, some migration use cases and for getting acquainted with Google Cloud Storage. It might not a good fit for a real Viya 3.5 production environment. It certainly has some limitations that I will mention at the end.

 

SAS Viya 4 will bring support for a Google Cloud Storage CASLIB (similar to what we already have with AWS S3 and Azure Data Lake Storage Gen2). So, gcsfuse provides an opportunity to step into Google Cloud Storage world and see what benefits it could bring.

 

In terms of pricing, this utility is free of charge. However, any data operation involved on the mount point contents (which is ultimately GCS) will be charged accordingly.

How to mount a GCS bucket as a local directory?

gcsfuse provides several options to mount your GCS bucket to a local directory. Here is an example:

 

nir_post_57_01_gcsfuse_command.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Basically, I will see locally in /opt/gcs/mount all the files available in my gcpdm-test GCS bucket. For authentication, this utility relies on a service account credentials file that can be obtained easily with a gcloud command (“gcloud iam service-accounts keys create”). If you want to share the mount point to other users, the user who uses gcsfuse must add the allow_other option.

Where to mount a GCS bucket as an OS directory?

nir_post_57_02_where_to_mount.png

Well, it depends on what you want to achieve. You have multiple options:

  • On the Compute Server for accessing GCS bucket's files from SAS (libname)

     

  • On the CAS Controller for accessing GCS bucket's files from CAS (PATH CASLIB)

     

  • On the CAS Workers for accessing GCS bucket's files from CAS in parallel (DNFS CASLIB)

What does the code look like?

It looks like any traditional SAS code accessing local or network paths. Nothing specific related to GCS.

 

nir_post_57_03_code.png

Limitations

From a SAS standpoint, there’s a few limitations in using gcsfuse, especially around writing (saving) data from CAS to GCS:

  • Saving CAS data to GCS as a Parquet file does not work (PATH and DNFS)

    When CAS creates a Parquet file, it actually creates a directory, first temporarily and then renames it with the target file name. Because of the nature of an object storage, a directory renaming is not supported in GCS, causing the save operation to fail (even if the Parquet partitions are created successfully).

  • Saving CAS data to GCS in parallel using DNFS CASLIBs does not work

    Concurrent updates of the same file from different machines are not handled correctly by gcsfuse (no concurrency control for multiple writers to a file). Therefore, saving data as SASHDAT, CSV formats in parallel using a DNFS CASLIB is not an option.

 

Thanks for reading.

Version history
Last update:
‎09-04-2020 11:03 AM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels