In my previous article, I covered how to take advantage of Google Cloud Storage (GCS) from SAS Viya 3.5 using a FUSE adapter (Cloud Storage Fuse). While we don’t have an official support for GCS from a SAS or CAS perspective (it should come soon with Viya 4), Cloud Storage Fuse offers an alternative for transparently accessing data files located in GCS from SAS or CAS.
I took this Cloud Storage Fuse evaluation opportunity to move further and collect some performance metrics on Google Cloud Storage. This should give us some ideas on the performance we can expect from Google Cloud Storage in Viya 4 when we will have the GCS CASLIB support.
Let me explain the context.
I created a data set based on my favorite sample table 🙂: PRDSALE. I added 50 numeric variables and 50 VARCHARs of varying strings (up to 1000 characters) and expanded it generously. I then exported it to various file formats of interest.
Results:
# of variables | 113 |
# of observations | 576,000 |
SAS7BDAT file size | 29GB |
SASHDAT file size | 15GB |
CSV file size | 14GB |
PARQUET file size | 2GB |
I’ve setup an 11 Compute Engines SAS Viya 3.5 environment with 9 CAS workers. Indeed, I wanted to measure the impact of DNFS with a large CAS cluster.
Select the image to see a larger version.
Mobile users: To view the image, select the "Full" version at the bottom of the page.
In terms of storage, I wanted to compare Google Cloud Storage with something simple, common, similar to local disks and relatively acceptable from a cost and performance perspective.
Filestore is Google's fully managed Network Attached Storage (NAS) solution. It’s easy to setup and easy to mount on the different Compute Engines through NFS. It comes with 2 main performance classes, Standard and Premium. I chose Standard with SSD disks which probably correspond to a medium class performance.
For the test, I mounted either the Filestore instance (using NFS) or the GCS bucket (using gcsfuse) on either the CAS Controller only (for PATH test) or all the 10 CAS Nodes (for DNFS test).
The simple goal of the test is to measure reading and writing operations of the different file formats across different storage combinations.
File | Size | PATH on Filestore | DNFS on Filestore (*or PATH+DTM="parallel") | PATH on GCS | DNFS on GCS (*or PATH+DTM="parallel") |
---|---|---|---|---|---|
prdsale.sashdat | 15GB | READ WRITE |
READ WRITE |
READ WRITE |
READ WRITE |
prdsale.parquet | 2GB | READ WRITE |
READ WRITE |
READ WRITE |
READ WRITE |
prdsale.csv | 14GB | READ WRITE |
READ WRITE |
READ WRITE |
READ WRITE |
prdsale.sas7bdat | 29GB | READ WRITE |
*READ | READ WRITE |
*READ |
*SAS7BDAT files cannot be read using a DNFS CASLIB. However, they can be read in parallel using a PATH CASLIB with the dataTransferMode="parallel" option (hence the PATH+DTM label), assuming the file is available on every CAS node. That is what we measured in those cells. Also, they cannot be written in parallel.
Times are expressed in seconds. They are the average of multiple runs.
READ
File | Size | PATH on Filestore | DNFS on Filestore (*or PATH+DTM=”parallel”) | PATH on GCS | DNFS on GCS (*or PATH+DTM=”parallel”) |
---|---|---|---|---|---|
prdsale.sashdat | 15GB | 26.66 | 22.24 | 70.03 | 24.55 |
prdsale.parquet | 2GB | 5.82 | 2.65 | 16.25 | 47.58 |
prdsale.csv | 14GB | 29.9 | 13.97 | 66.5 | 14.97 |
prdsale.sas7bdat | 29GB | 44.36 | *18.83 | 142.64 | *21.51 |
*SAS7BDAT files read in parallel using a PATH CASLIB (dataTransferMode="parallel" option).
Read performance comments:
WRITE
File | Size | PATH on Filestore | DNFS on Filestore | PATH on GCS | DNFS on GCS |
---|---|---|---|---|---|
prdsale.sashdat | 15GB | 35.61 | 18.54 | 111.16 | *KO |
prdsale.parquet | 2GB | 11.68 | 6.2 | **24.98 | **8.03 |
prdsale.csv | 14GB | 60.71 | 45.77 | 150.3 | *KO |
prdsale.sas7bdat | 29GB | 90.26 | ***N/A | 587.14 | ***N/A |
*Cloud Storage FUSE (gcsfuse) does not handle correctly concurrent update, resulting in unpredicted run times and file corruption.
**Due to GCS limitations, renaming a folder is not possible. Creating a parquet file from CAS requires the renaming of a folder. The final operation fails but the data files are correctly created. For the sake of completeness, times have been recorded though.
***SAS7BDAT files cannot be written in parallel.
Write performance comments:
Globally, I was expecting GCS to perform worse because of GCS being remote and because of gcsfuse. The performance observed isn’t too bad especially in this “full GCP” architecture where I used GCP virtual machines (Compute Engines).
Let’s finally do a rough estimate of the cost with the following assumptions:
Google Filestore | Google Cloud Storage |
---|---|
$768 | $57 |
Generally speaking, cloud object storage has been an important focus for SAS over the last months, and customers, with the cloud adoption, are moving their data (high volumes, high variety) from Hadoop (HDFS) to that kind of storage. When accessed from cloud instances, cloud object storage provides decent performance at a very cheap cost. GCS will be soon better integrated with SAS, but we can already experiment its performance with gcsfuse.
Thanks for reading.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.