Ande Stelk, Google
A technical discussion on how to best deploy SAS in GCP touching on Google Compute Engine, Storage and other configuration tips and tricks are included. We will also cover integration with BigQuery using SAS Access to BigQuery. SAS Viya v3.x and SAS v9 platforms are referenced. This presentation is specific to GCP.
More and more businesses are moving to the cloud and one of the many reasons they choose Google Cloud are the industry leading data platforms such as Biq Query. The purpose of this paper is to provide a high-level overview of how to migrate SAS onto Google Cloud Platform (GCP) from on-premise. As these efforts usually include data platform migrations, general best practices for connecting to BigQuery (BQ) are also discussed.
Typically GCP is leveraged as Infrastructure As A Service (IAAS) for SAS (version 9.4 or Viya version 3.5) using Google Compute Engine (GCE) server instances. For SAS Viya 4.0, Google Kubernetes Engine (GKE) and Docker containers can be leveraged however those deployments are not in the scope of this paper.
Moving SAS into the cloud has many advantages such as allowing the application to be closer to the data source(s), avoids any applicable cloud egress charges, and supports the general IT strategy to move away from proprietary data centers.
Moving to GCP has many many advantages however there are some considerations to be made.
The highly virtualized infrastructure of any public cloud can impact query performance. If SQL pass-through or SAS Accelerator solutions are currently leveraged on premise that will no longer be in GCP. I highly recommend defining specific Service Level Agreements (SLA’s) between analyst teams and IT related to the business reasons queries must complete in a specific timeframe, conduct testing and contingency plans to ensure they do. Don't wait until there are issues at the end of the migration but rather start these discussions proactively. Different storage configurations, use of in-memory technologies or a blend of SAS an GCP services such as Tensor Flow or ML on BQ are all options to alleviate any performance bottlenecks.
Ensure you have contacted both your SAS and Google Cloud representatives. They will be happy to partner with you to design the best architecture. Some recommended discussion points would be;
Notes:
SAS recommends the data it consumes (e.g., BigQuery, Cloud Storage, DataProc) be located in the same region/zone(s) as the GCE instances SAS is installed upon.
SAS GRID is not generally deployed on GCP - SAS Grid is from the v9.x platform and allows workloads to be distributed among multiple compute servers leveraging a shared file system. More information below.
#!/usr/bin/env bash
PROJECT_ID=<YOUR PROJECT ID>
CONF_SCRIPT=gs://<YOUR BUCKET>/install.sh
REGION=us-central1
ZONE=${REGION}-a
SUBNET=default
IP_ADDRESS=<my internal IP>
MACHINE_TYPE=n1-highmem-32
INSTANCE_NAME=sas-perf-testing-${MACHINE_TYPE}-local24
# reserve private/internal IP address
gcloud compute addresses create my-vm-ip-address \
--region ${REGION} --subnet ${SUBNET} --addresses ${IP_ADDRESS}
gcloud beta compute --project=${PROJECT_ID} instances create ${INSTANCE_NAME} \
--zone=${ZONE} --machine-type=${MACHINE_TYPE} \
--scopes=storage-ro \
--metadata startup-script-url=${CONF_SCRIPT} \
--image=rhel-7-v20200403 --image-project=rhel-cloud \
--boot-disk-size=20GB --boot-disk-type=pd-ssd \
--boot-disk-device-name=${INSTANCE_NAME}-boot-disk \
--private-network-ip=my-vm-ip-address \
--min-cpu-platform="Intel Skylake" \
--local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME
--local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME --local-ssd=interface=NVME
--local-ssd=interface=NVME --local-ssd=interface=NVME --reservation-affinity=any \
Here is a sample startup script:
#!/usr/bin/env bash
CONF_BUCKET=gs://<YOUR BUCKET>
cd /tmp/
# get all required packages
sudo yum install bc time wget expect -y
sudo wget http://ftp.sas.com/techsup/download/ts-tools/external/SASTSST_UNIX_installation.sh -O /tmp/SASTSST_UNIX_installation.sh
sudo wget http://mirror.centos.org/centos/6/os/x86_64/Packages/xfsprogs-3.1.1-20.el6.x86_64.rpm -O /tmp/xfsprogs-3.1.1-20.el6.x86_64.rpm
sudo chmod +x /tmp/SASTSST_UNIX_installation.sh
sudo yum localinstall /tmp/xfsprogs-3.1.1-20.el6.x86_64.rpm -y
sudo gsutil cp ${CONF_BUCKET}/sas_tools.tar /tmp/
sudo tar xvf sas_tools.tar
echo "Check for nvme"
lsblk|grep nvme
if [[ $? -eq 0 ]]
then
echo "Setting up nvm array"
nvm_arr_size=`lsblk|grep nvme|wc -l`
nvm_arr=`for drive in $(seq 1 $nvm_arr_size); do printf "/dev/nvme0n$drive " ; done`
sudo mdadm --create /dev/md0 --level=0 --raid-devices=$nvm_arr_size $nvm_arr
echo "Creating xfs fs for local-ssd"
sudo mkfs.xfs /dev/md0 ; sudo mkdir /mnt/sasfs ; sudo mount /dev/md0 /mnt/sasfs
else
echo "Creating xfs fs for pd-ssd"
sudo mkfs.xfs /dev/sdb ; sudo mkdir /mnt/sasfs ; sudo mount /dev/sdb /mnt/sasfs
fi
if [[ -d /mnt/sasfs ]]
then
sudo chmod a+w /mnt/sasfs
sudo /tmp/sas_tools/rhel_iotest.sh -t /mnt/sasfs &
fi
SAS Access Engine / Connectors for data sources may be licensed individually and to connect to BQ, SAS Access for ODBC or SAS Access for BQ is required. For most customers, the latest version of SAS Access for BigQuery is preferred over SAS Access for ODBC. The latter requires additional steps to install and configure the ODBC driver BigQuery, and this driver is available for download from the GCP ODBC and JDBC drivers for BigQuery page. However, if other databases such as Cloud SQL will be used with SAS in addition to BigQuery, using the SAS Access for ODBC connector might be a better fit.
Write versus read requirements, table width (e.g., number of columns), and column width (string length in particular) may necessitate different CAS configuration optimizations. The more data analyzed in memory, the more robust the cluster and larger the CAS cache needs to be. In general, the following are recommended best practices when using SAS Studio or SAS Visual Analytics with BigQuery, but specific customer requirements should always be taken into consideration as well.
I hope you found this information useful and many thanks to all the SAS and GCP teams who contributed to it. SAS and GCP technologies and services are fast changing so always verify information in this article with the most current documentation available from SAS and Google Cloud.
Your comments and questions are valued and encouraged. Contact the author at:
Ande Stelk
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Ready to level-up your skills? Choose your own adventure.