BookmarkSubscribeRSS Feed

SAS Viya 3.5 and Google Cloud Storage (GCS) Performance Feedback

Started ‎09-24-2020 by
Modified ‎09-24-2020 by
Views 3,334

In my previous article, I covered how to take advantage of Google Cloud Storage (GCS) from SAS Viya 3.5 using a FUSE adapter (Cloud Storage Fuse). While we don’t have an official support for GCS from a SAS or CAS perspective (it should come soon with Viya 4), Cloud Storage Fuse offers an alternative for transparently accessing data files located in GCS from SAS or CAS.

 

I took this Cloud Storage Fuse evaluation opportunity to move further and collect some performance metrics on Google Cloud Storage. This should give us some ideas on the performance we can expect from Google Cloud Storage in Viya 4 when we will have the GCS CASLIB support.

 

Let me explain the context.

The data

I created a data set based on my favorite sample table 🙂: PRDSALE. I added 50 numeric variables and 50 VARCHARs of varying strings (up to 1000 characters) and expanded it generously. I then exported it to various file formats of interest.

 

Results:

 

# of variables 113
# of observations 576,000
SAS7BDAT file size 29GB
SASHDAT file size 15GB
CSV file size 14GB
PARQUET file size 2GB

The architecture

I’ve setup an 11 Compute Engines SAS Viya 3.5 environment with 9 CAS workers. Indeed, I wanted to measure the impact of DNFS with a large CAS cluster.

 

nir_post_58_01_gcp_gcs_filestore_arch.png

Select the image to see a larger version.
Mobile users: To view the image, select the "Full" version at the bottom of the page.

 

In terms of storage, I wanted to compare Google Cloud Storage with something simple, common, similar to local disks and relatively acceptable from a cost and performance perspective.

 

Filestore is Google's fully managed Network Attached Storage (NAS) solution. It’s easy to setup and easy to mount on the different Compute Engines through NFS. It comes with 2 main performance classes, Standard and Premium. I chose Standard with SSD disks which probably correspond to a medium class performance.

 

For the test, I mounted either the Filestore instance (using NFS) or the GCS bucket (using gcsfuse) on either the CAS Controller only (for PATH test) or all the 10 CAS Nodes (for DNFS test).

The test scenarios

The simple goal of the test is to measure reading and writing operations of the different file formats across different storage combinations.

 

File Size PATH on Filestore DNFS on Filestore (*or PATH+DTM="parallel") PATH on GCS DNFS on GCS (*or PATH+DTM="parallel")
prdsale.sashdat 15GB READ
WRITE
READ
WRITE
READ
WRITE
READ
WRITE
prdsale.parquet 2GB READ
WRITE
READ
WRITE
READ
WRITE
READ
WRITE
prdsale.csv 14GB READ
WRITE
READ
WRITE
READ
WRITE
READ
WRITE
prdsale.sas7bdat 29GB READ
WRITE
*READ READ
WRITE
*READ

 

*SAS7BDAT files cannot be read using a DNFS CASLIB. However, they can be read in parallel using a PATH CASLIB with the dataTransferMode="parallel" option (hence the PATH+DTM label), assuming the file is available on every CAS node. That is what we measured in those cells. Also, they cannot be written in parallel.

The metrics

Times are expressed in seconds. They are the average of multiple runs.

 

READ

 

File Size PATH on Filestore DNFS on Filestore (*or PATH+DTM=”parallel”) PATH on GCS DNFS on GCS (*or PATH+DTM=”parallel”)
prdsale.sashdat 15GB 26.66 22.24 70.03 24.55
prdsale.parquet 2GB 5.82 2.65 16.25 47.58
prdsale.csv 14GB 29.9 13.97 66.5 14.97
prdsale.sas7bdat 29GB 44.36 *18.83 142.64 *21.51

 

*SAS7BDAT files read in parallel using a PATH CASLIB (dataTransferMode="parallel" option).

 

Read performance comments:

  • Reading GCS files serially from a PATH CASLIB (from the CAS controller) is 2 to 3 times slower than reading the same files serially from Filestore.
  • However, reading GCS files in parallel from a DNFS CASLIB (from all CAS nodes) is only up to 10% slower (except for Parquet files) than reading the same files in parallel from Filestore. We can say than reading the files in parallel tend to minimize the gap of performance between Filestore and GCS.
  • Parquet is the exception to the rule (orange cell): reading Parquet files in parallel from GCS is surprisingly bad.
  • Parallel reading (DNFS) with 9 workers from Filestore is generally 2 times faster than serial reading on the same file system. Parallel reading (DNFS) has more impact with GCS (2, 4, 6 times faster in our cases).

WRITE

 

File Size PATH on Filestore DNFS on Filestore PATH on GCS DNFS on GCS
prdsale.sashdat 15GB 35.61 18.54 111.16 *KO
prdsale.parquet 2GB 11.68 6.2 **24.98 **8.03
prdsale.csv 14GB 60.71 45.77 150.3 *KO
prdsale.sas7bdat 29GB 90.26 ***N/A 587.14 ***N/A

 

*Cloud Storage FUSE (gcsfuse) does not handle correctly concurrent update, resulting in unpredicted run times and file corruption.

 

**Due to GCS limitations, renaming a folder is not possible. Creating a parquet file from CAS requires the renaming of a folder. The final operation fails but the data files are correctly created. For the sake of completeness, times have been recorded though.

 

***SAS7BDAT files cannot be written in parallel.

 

Write performance comments:

  • Similarly, writing GCS files serially from a PATH CASLIB (from the CAS controller) is 2 to 3 times slower (except for SAS7BDAT) than writing the same files serially to Filestore.
  • Writing SAS7BDAT files is more than 6 times slower on GCS than on Filestore.
  • We don’t have enough data to see the impact of DNFS on writing to GCS. Essentially, gcsfuse is quite limited on this.
  • DNFS’ parallel WRITE capabilities, in this example writing to 9 CAS workers simultaneously, give up to 2 times performance improvement when writing to Google Filestore.

Globally, I was expecting GCS to perform worse because of GCS being remote and because of gcsfuse. The performance observed isn’t too bad especially in this “full GCP” architecture where I used GCP virtual machines (Compute Engines).

The cost

Let’s finally do a rough estimate of the cost with the following assumptions:

  • 1 month bill
  • 2560GB of data (that’s the minimum bet for Filestore BASIC SSD class)
  • For GCS, arbitrary number of 1 million Class A operations and 1 million Class B operations (GCS operations on data are categorized depending on their nature: get, insert, copy, delete, etc.)
Google Filestore Google Cloud Storage
$768 $57

Conclusion

Generally speaking, cloud object storage has been an important focus for SAS over the last months, and customers, with the cloud adoption, are moving their data (high volumes, high variety) from Hadoop (HDFS) to that kind of storage. When accessed from cloud instances, cloud object storage provides decent performance at a very cheap cost. GCS will be soon better integrated with SAS, but we can already experiment its performance with gcsfuse.

 

Thanks for reading.

Version history
Last update:
‎09-24-2020 10:15 AM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels