🎄 Is your organization working with data in Google Cloud and wondering how to integrate your SAS data? Look no further – this jolly holiday tip will gift you some wonderful, cost-efficient solutions! 🎁
SAS offers 3 practical ways of accessing data in Google Cloud:
an optimized access engine for Google Cloud BigQuery
read and write files stored in Google Cloud Storage, including parquet files
a Spark access engine to work with data from Google DataProc
See below for an illustration of these access methods:
These access methods are supported on both SAS 9.4 and SAS Viya. The latest performance improvements for numerical data with BigQuery in SAS Viya have also been included in the SAS 9.4 M7 release. See below for a video showing you to do this, or read on in this juletip.
With CAS on SAS Viya, you have even more options, benefiting from our in-memory parallel analytical engine for enhanced performance.
Getting access
🎄 First things first – getting access! The easiest way is to use a JSON-based key file, which you can request from your Google Cloud administrator. Generating one based on a service account is convenient as we can reuse this across Google resources, like BigQuery and Google Cloud Storage. See below for a screenshot of the IAM & Admin page of Google Cloud Console how this may look like:
This service account should only have the necessary permissions – in this case, reading and writing to a Google Cloud Storage bucket and accessing BigQuery (read and write). Make sure to store this key file in a safe, protected location within your SAS system, such as your home folder. 🎁
The json-file typically looks like this:
{
"type": "service_account",
"project_id": "<your-google-project-id>",
"private_key_id": "<generated id>",
"private_key": "<-BEGIN PRIVATE KEY--a-quite-long-key--END PRIVATE KEY--\n>",
"client_email": "<an email address for the service account>",
"client_id": "<client-id>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url":"<generated>",
"universe_domain": "googleapis.com"
}
Google BigQuery
Let's start with BigQuery. SAS has simplified integration with various data sources, including Google BigQuery, through the libname statement. With the key-file that grants access to BigQuery, setting up a libname for BigQuery is straightforward:
libname bqlib bigquery cred_path="<path to key file>"
project="<project name>"
schema="google dataset name"
bulkload=YES
scanstringcolumns=yes ;
The bulkload option is crucial for ensuring efficient write operations, while the scanstringcolumns option directs the access engine to scan string columns to determine their appropriate lengths.
Using this libname, SAS code can interact with data in BigQuery just as it would with path-based libnames on a local filesystem. This approach is highly practical for making data in BigQuery accessible for use outside of SAS, and not the least accessing data in BigQuery produced outside of SAS.
For example, the SAS Studio flows below will write to the three BigQuery tables on the right by changing the libname to point to BigQuery. SAS DI Studio on SAS 9.4 will look similar – and will offer a similar approach to making data available to Google BigQuery.
You may notice that flows using BigQuery libnames might run somewhat slower compared to local path-based libnames. This is because BigQuery is accessed over the network from the SAS environment, and depending on the network configuration, some latency may be involved. The bulkload option helps mitigate this by storing data in a file close to BigQuery before bulk-inserting it into BigQuery.
With SAS Viya, you can leverage the in-memory parallel analytics CAS engine for enhanced performance. By using a CASLIB instead of a libname, you can load data into CAS and explore it directly from memory with SAS Visual Analytics. Since the data is already in memory, you won't need to send new queries to BigQuery, which helps reduce costs.
Data loading into CAS can be done in parallel across CAS worker nodes, and we've optimized the access engine for both loading data into CAS and writing from CAS to BigQuery. If you're curious about how this can be done, feel free to ask us in the comment field.
Google Cloud Storage
Google Cloud Storage (GCS) offers a practical solution for saving files for use outside of SAS or for loading data into SAS from external sources. You can save various types of files, such as CSV, Excel, and even sas7bdat files, providing a secure place for later use.
Here's how to save a sas7bdat file to a bucket in GCS using the same key mentioned above from your SAS system:
/* Writing a sas7bdat to GCS */
filename sasin "<path of sas7bdat file in your sas system to export to gcs>" recfm=n lrecl=256 ;
/* The file to be created will be accessible on GCS */
filename gcsout gcs "<path to file in your bucket>"
key="<path to key file> "
bucket="<name of bucket>" recfm=n lrecl=256 ;
options msglevel=i ;
/* Copy a SAS data set on GCS */
data _null_ ;
rc=fcopy('sasin','gcsout') ;
run ;
If you want to export a SAS dataset to a CSV file, use this code snippet:
/* Reference output file on GCS */
filename gcsout gcs "<path and file in the bucket>"
key="<your key file>"
bucket="<name of bucket>" ;
/* Export a SAS data set to a csv file on GCS */
proc export data=sashelp.cars file=gcsout dbms=csv replace ;
run ;
Importing data from a CSV file in GCS to a SAS dataset is similar:
filename gcsin gcs "<path and file in the bucket>"
key="<your key file>"
bucket="<name of bucket>" ;
/* Export a SAS data set to a csv file on GCS */
proc import datafile=gcsin out= work.<data imported> dbms=csv ;
getnames=yes;
run ;
Google Cloud Storage is "bottomless," allowing you to easily add storage capacity as needed. This can be particularly useful if the storage mounted to SAS is reaching its limit.
Working with parquet files in GCS
With the ability to load and save to Google Cloud Storage, you may also want to convert SAS data to an open file format like Parquet. You may even want to save data from Google BigQuery to the Parquet format in Google Cloud to save on queries to Google BigQuery. Compressed Parquet files save space and provide practical integration with the world outside of SAS. SAS has a Parquet engine that lets you read Parquet files directly from Google Cloud Storage quite efficiently. You can set up a libname towards the bucket as follows:
libname pqfiles parquet "<path in your GCS bucket>"
storage_platform="gcs"
storage_bucket_name="<bucket name>"
storage_gcs_key_file="<path to key file>";
Reading from such a libname works fine for most data types, including float, integer, and strings. Dates typically come across as strings, so some reformatting may be needed based on how the date has been formatted in the string. For instance, if the data is formatted as YYYY-MM-DD in the column Date, you can reformat it as follows:
data work.data;
set pqfiles.data;
Datefield = input(Date, yymmdd10.);
format Datefield yymmdd10.;
drop Date;
rename Datefield = Date;
run;
Unfortunately, when writing to such a libname, not all data types get converted correctly; dates and floats may not be handled properly. With a little help from Python and a user-contributed sasdata2parquet function, we can sort this out. On SAS Viya with Python installed and enabled, you can use Python inside SAS Studio. Alternatively, you can use your own Python environment with saspy installed by running:
pip install saspy
You need a configuration file to connect and authenticate to your SAS environment. Save this file with a .py extension in a well-protected area, such as your home folder, as sascfg_personal.py:
SAS_config_names=['httpsviya']
SAS_config_options = {'lock_down': False,
'verbose' : True,
'prompt' : True
}
SAS_output_options = {'output' : 'html5'} # not required unless changing any of the default
httpsviya = {'ip' : '<server host name fully qualified>',
'context' : 'SAS Studio compute context',
'user' : '<your user id on sas>',
'pw' : '<your password on sas>',
'options' : ["fullstimer", "memsize=1G"]
}
For more details and other authentication options, check out the documentation of saspy at Getting started — saspy 5.101.0 documentation. With the configuration in place, you can convert your SAS dataset to a Parquet file with the following Python code:
import saspy sas = saspy.SASsession(cfgname='httpsviya', cfgfile='<path to your configuration file>’)
log=sas.saslib(libref='EXPORTLIB', path='<path to folder where you want to export sas data set from>’)
sas.sd2pq(parquet_file_path ='<local path where you want to save your parquet file>', table='TOY_SUMMARY', libref='EXPORTLIB')
Our newfound ability to save data to parquet files from SAS will also help us save data from BigQuery to Parquet using the same approach. Saving data from BigQuery to Parquet provide a cost efficient way of saving data from Bigquery into very cost efficient and convenient storage; ready for exploration and analytical consumption. A huge cost saving compared to loading data from BigQuery as each load of data will incur cost from Google Cloud. We just need to setup a libref to Bigquery and use the same approach as above - see the following:
import saspy
sas = saspy.SASsession(cfgname='httpsviya', cfgfile='<path to your configuration file>’)
log=sas.saslib(libref='BQLIBPY', engine='bigquery',
options = 'cred_path="/export/home/users/norlsk/sa_key.json" project="sas-gelsandbox" schema="sas_innovate" bulkload=YES scanstringcolumns=yes')
sas.sd2pq(parquet_file_path ='/export/home/users/norlsk/datafromBQ.parquet', table='TOY_SUMMARY', libref='BQLIBPY')
With the parquet files in place, you can then upload them to GCS – if the file is locally on your own laptop you can simply upload directly to your bucket using the user interface for Google Cloud Storage as shown below:
Within SAS, you can save the parquet file to GCS similarly to how you save your sas7bdat files:
/* Writing our local parquet files out to GCS */
filename sasdata "/export/home/users/norlsk/data.parquet" recfm=n lrecl=256 ;
filename sasbq "/export/home/users/norlsk/datafromBQ.parquet" recfm=n lrecl=256 ;
/* The files to be created will be accessible on GCS */
filename gcssas gcs "data/datafromsas.parquet"
key="/export/home/users/norlsk/sa_key.json"
bucket="norlsk-fans-demo" recfm=n lrecl=256 ;
filename gcsbq gcs "data/datafromBQ.parquet"
key="/export/home/users/norlsk/sa_key.json"
bucket="norlsk-fans-demo" recfm=n lrecl=256 ;
options msglevel=i ;
/* Copy SAS data and Bigquery data as Parquet files to GCS */
data _null_ ;
rc=fcopy('sasdata','gcssas') ;
rc=fcopy('sasbq','gcsbq') ;
run ;
If you are interested in data access through Dataproc and spark, we have covered spark-based access in this blog from earlier this year: Data to Databricks? No need to recode - get your existing SAS jobs to ... - SAS Support Communities.
Are you ready to work with SAS data in Google Cloud using BigQuery and Google Cloud Storage? Let us know in the comments what you plan to use it for.
Merry Christmas and a joyful New Year with SAS data on Google Cloud! 😊
... View more