You might have missed it; but, if your SAS user groups within your organization have begun using the Google Cloud, there was an important new offering released in 19w34 which may help them. This new offering is SAS/ACCESS engine for Google BigQuery.
In this article, I'll cover some key points, briefly introduce Google BigQuery, show how to implement the connection from SAS and CAS to BigQuery, and try to answer some of the typical questions a technical architect/integration specialist would ask.
Key points
|
Here is the official SAS documentation.
BigQuery is an enterprise data warehouse provided as a Google Cloud Platform service. According to Google, BigQuery can process billions of rows in seconds. It is scalable and has an in-memory engine which provides various analytical capabilities.
Once the data has been moved into BigQuery, you can run queries using the good old SQL language leveraging the processing and architecture features of Google's infrastructure.
It is very easy to load and query Big Query either using the SQL UI (available in the GCP console) or using the command line (if the Google Cloud SDK has been installed).
The BigQuery documentation provides quick-start for both method: using the Web UI or the command-line tools.
You first need to define a "Dataset", in which you create your tables from scratch with some SQL code or import the data by uploading them from your PC or from the server where the "bq" command line is installed. You can also import from another GCP services like Google Cloud Storage (AWS S3 equivalent), Google Drive or BigTable.
BigQuery also has a "Data Transfer Service" with various connectors and agents that can be used to migrate data from other SaaS providers, Cloud storage solutions as AWS S3, or Cloud or on-Prem Data warehouses (such as Amazon RedShift or Teradata) into BigQuery.
For the purpose of this article, we loaded the names.csv file using the "upload" method pointing directly on a web resource from this datasets catalog and called our BQ table: "babynames".
Once you have loaded your data, you can query them either from the BigQuery Web UI
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
or with the "bq" command line:
[centos@frarpo-jumphost ~]$ bq show sas-gel:sampleds.babynames
Table sas-gel:sampleds.babynames
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels
----------------- --------------------- ------------ ------------- ------------ ------------------- --------
27 Aug 12:11:42 |- BRTH_YR: integer 13962 789099
|- GNDR: string
|- ETHCTY: string
|- NM: string
|- CNT: integer
|- RNK: integer
The BigQuery engine exposes a REST API for easy programmatic access and application integration. Client libraries are available in Java, Python, Node.js, C#, Go, Ruby, and PHP. And finally, there are a variety of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading the data.
Before trying out the example below, it's worth taking a moment to discuss the costs if using Big Query.
As of October 7th, 2019 BigQuery charges for data storage, streaming inserts, and for querying data.
But loading and exporting data are free of charge.
Also note that the service is free for up to 1 TB of queried data per month and 10 GB of stored data.
Finally, Google provides a convenient cost calculator where you can estimate the costs specifically for BigQuery.
Hopefully – you now have a rough idea of what BigQuery is, but for more details about the BigQuery, the best information source would be the official Google Cloud page.
Of course, Google is not the first one to offer this type of service in the Cloud sometimes called "DBaaS" for "Database As A Service" (you might have heard about Amazon Redshift or Azure SQL Datawarehouse?)
Now let's see how SAS can work with it!
BigQuery is a "serverless" Cloud Datawarehouse. It means that when you want to connect to it, you don’t have to provide a server name and port, or even a user ID.
You are simply contacting a specific database service that is made available for your Cloud tenant (which is called a “project” in GCP).
Now the question is: how do we authenticate?
While several authentication methods exist in the Google Cloud Platform to call the GCP APIs , Google usually recommends to use a service account that is associated with the GCP project (as opposed to a specific user).
In addition, our current version of SAS/ACCESS interface to Google BigQuery only supports authentication with a service account (oAuth authentication is in the roadmap for future releases of SAS Viya).
So, let's see how to do this authentication with the service account.
If you have the appropriate privileges, you can use the GCP console to create a private key for the service account attached to your Google project and download the related credentials as a JSON file.
A private key will be stored inside the generated JSON file along with the service account details (email, id, project id, etc…). Several JSON files (each containing a different private key) can be generated for several or the same service account and used to gain access to the APIs.
As explained in the SAS documentation, what we need to provide to run a successful SAS LIBNAME statement for the Google BigQuery Engine is:
Assuming we have made the .JSON file (containing our service account credential) available on the machine hosting the Compute Server, we should be able to run something like:
LIBNAME BQ bigquery
CRED_PATH='/home/viyademo01/bq/sas-gel-ae7285ebb812.json'
PROJECT='sas-gel'
SCHEMA='sampleds';
It is not working...
But it is because, for the first time we access, we need to enable the API.
So, if we open the provided URL ( and make sure we are signed in with the google account associated to your project) we see something like:
If we click on the "Enable" button and try again to run our SAS Libname statement:
Hurrah! It is working now !
Now, we can see the table's columns and open it, just like with any other SAS/ACCESS interface.
To allow CAS to access directly to the BigQuery data, we need to create a "bigquery" type caslib.
Here is the code to create a Google BigQuery caslib:
cas mysession;caslib gbq desc='Google BigQuery Caslib' dataSource=(srctype='bigquery' credfile='/opt/sas/viya/bqdc/sas-gel-ae7285ebb812.json' project='sas-gel' schema='sampleds');
The key difference is that, this time, the credential file must be available on the CAS Controller as that's the place where access to BigQuery is done.
Also, if you are using the Viya visual interfaces or SASStudioV, remember that (by default) the CAS actions run under the "cas" account identity. So, it is required that the credential file ("sas-gel-ae7285ebb812.json" in this example) is located in a folder where the "cas" account can read it.
Otherwise (for example, if you place it a user personal folder) you will get a message like:
ERROR: The connection to the data source driver failed.
ERROR: Error: stat /home/viyademo01/bqdc/sas-gel-ae7285ebb812.json: permission denied
ERROR: Function failed.
List and Load data from Google BigQuery table.
proc casutil;
list files incaslib="gbq";
load casdata="babynames" incaslib="gbq" outcaslib="casuser" casout="names_from_gbqc";
list files incaslib="casuser";
contents casdata="%upcase(names_from_gbqc)" incaslib="casuser";
quit;
So strictly speaking: No.
Because what is generally called "parallel loading" refers to the Embedded process load mechanism where all the distributed database "nodes" send or get data in parallel across multiple CAS workers.
Currently (October 2019), it is only possible to do that with Teradata, Hadoop or Spark as they are the only data system for which we offer the SAS In-Database Technology product on SAS Viya.
However, in a CAS MPP deployment, what you can do with the SAS/ACCESS engine for Viya and the Data Connector is to perform “multi-node loading.” It means that each CAS worker will access to a slice of the data. Although this technique does not guarantee parallel transfer from either side, it can be a more flexible solution than serial loading. Workload communications will be spread across all the CAS workers (instead of simply using the CAS controller as a single point of interaction).
For information about the loading options available to users of SAS Viya, please take a look at the excellent 2019 SGF paper written by Rob Collum of SAS.
So, let's see what how to make it happen.
Here is the code:
cas mysession;
caslib gbqmn desc='Google BigQuery Multi-Node Caslib'
dataSource=(srctype='bigquery'
credfile='/opt/sas/viya/bqdc/sas-gel-ae7285ebb812.json'
project='sas-gel'
schema='sampleds'
NUMREADNODES=10
NUMWRITENODES=10);
proc casutil;
list files incaslib="gbqmn";
load casdata="babynames" incaslib="gbqmn" outcaslib="casuser" casout="names_from_gbqmn";
list files incaslib="casuser";
contents casdata="%upcase(names_from_gbqmn)" incaslib="casuser";
quit;
The only change is the addition of NUMREADNODES and NUMWRITENODES (please refer to Uttam Kumar's article if you want more details on CAS Multi-Node data loading).
However, if you only do that: no matter how many CAS workers you have, you will see a message saying that only one worker was used (Actually it is the CAS controller doing the standard serial load):
WARNING: The value of numReadNodes(5) exceeds the number of available worker nodes(1). The load will proceed with numReadNodes=1.
The trick here is that : we need to make the credential file available and readable by the cas account in the same path (/opt/sas/viya/bqdc/sas-gel-ae7285ebb812.json) on all the CAS workers.
So, you can either make the credential file available through a shared file system location or copy the credential file in the exact same location on all CAS worker hosts and make sure the CAS account can read it.
When you do that, the log shows a message confirming the multi-node loading:
87 proc casutil;
NOTE: The UUID '92291f20-72e5-4344-8032-e0606a0d2191' is connected using session MYSESSION.
88 /*list files incaslib="gbqmn";*/
89 load casdata="babynames" incaslib="gbqmn" outcaslib="casuser"
90 casout="names_from_gbqmn";
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(2). The load will proceed with numReadNodes=2.
NOTE: Cloud Analytic Services made the external data from babynames available as table NAMES_FROM_GBQMN in caslib
CASUSER(viyademo01).
NOTE: The Cloud Analytic Services server processed the request in 32.891526 seconds.
91 /*list files incaslib="casuser";
92 contents casdata="%upcase(names_from_gbqmn)" incaslib="casuser";*/
93 quit;
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 32.90 seconds
cpu time 1.02 seconds
Now we are sure that the data loading is done in multi-node mode.
Well... it is not really a question for us (SAS Consultants), but more for Google as the encryption has to happen between the BigQuery client and the BigQuery engine (there is no option to control that from the SAS/ACCESS interface perspective).
But with a bit of research, here is what we can learn from the official Google documentation:
"To protect your data as it travels over the Internet during read and write operations, GCP uses Transport Layer Security (HTTPS). For more information, see Encryption in transit in GCP.
Within Google data centers, your data is encrypted when it is transferred between machines."
See this whitepaper for more details.
While it is possible, in BigQuery to set permissions and predefined or custom Cloud IAM roles, the current version of the SAS/ACCESS engine only supports service account authentication.
Therefore, there is no way to implement permissions at the end-user level when accessing with the SAS/ACCESS engine.
However, it is possible to define multiple service accounts within the same google project with different predefined or even custom roles (associated to permissions such as dataset or tables creation, deletion, read/write privileges).
It is also interesting to note, that, to date, with Google BigQuery, you cannot grant permissions on specific tables, views, columns, or rows. A dataset is the lowest level resource that supports access controls in BigQuery.
As more companies and organizations look to reduce the total cost of ownership (TCO) for their existing data warehouse implementations, the public Cloud based database offerings (Amazon Redshift, Azure SQL Datawarehouse, BigQuery), Cloud object storage (AWS S3, Amazon Data lake Storage, Google Cloud storage) or SaaS (Snowflakes, etc...) are becoming an attractive and viable option(s).
It is good to see SAS developing new ACCESS engines data connectors and CAS native access (caslibs) to ensure that our customers can continue to leverage the power of SAS analytics when they decide to move their data in the Cloud.
Thanks for reading and good luck with using SAS/ACCESS to BigQuery.
Thank you @SalmanMaher for the update ! That's great news !
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.