How can you make faster business decisions using big data (think gigabytes or more) with SAS High-Performance Analytics (HPA)? And how does this technology work with SAS Enterprise Miner?
This tip, the first among a series that answers these questions, defines the terms associated with HPA and gives a high-level overview of various SAS products that use distributed computing appliances -- in other words, many computers sharing and working as a team.
The products available under the SAS High-Performance Analytics suite are designed to reduce the response time of predictive models for big data by exploiting the parallelization techniques possible in a distributed computing environment. For a complete picture, below is the list of high performance products available. Those of you with SAS Enterprise Miner will need a SAS High-Performance Data Mining license to add this capability.
The suite of products above provide High-Performance (HP) counterparts to commonly used statistical/data mining/optimization procedures available in SAS products like SAS/STAT, SAS/OR and so on. For example, the HPREG procedure in SAS High-Performance Statistics is the counterpart for REG in SAS/STAT.
Note that the procedures and their HP counterparts are not exact replicas in syntax or functionality. And even when they do match, the results vary due to differences in the computing algorithm when processed serially on a single machine versus in parallel on a distributed environment. In SAS Enterprise Miner, the HP procedures are exposed as nodes (HP Explore, HP Impute, HP Regression, and so on) under the HPDM (High Performance Data Mining) tab.
The distributed computing environment or MPP (Massively Parallel Processing) is a cluster (group) of homogeneous nodes (machines) with one head node and multiple worker nodes that divide the data, and work in parallel to achieve fast completion times. This setting is mostly used when data is huge and the model completion times on a single server are unacceptable for business needs.
The single node counterpart to MPP is SMP or Symmetric Multiprocessing. In SMP, a single server with multiple CPUs (processors or cores) can perform work in parallel using thread scheduling. The MPP cluster, with many such servers, each with multiple processors, work in tandem as a unified computing environment -- thus increasing the amount of parallelization possible. Another advantage of the MPP cluster is its ability to incrementally scale based on customers’ present and future data and compute requirements.
NOTE: In Figure 1 MPP, the data is distributed across worker nodes only but you can configure to do this on the name-node too if necessary.
Using the MPP architecture, the High-Performance procedures build faster predictive models by dividing the data and analytic computations on multiple nodes and performing the work in parallel. The more data you have, the more nodes you can provision to maintain the response times within acceptable limits. When an analytical task runs using HPA on MPP, multiple processor threads are used on each node in the cluster. The number of threads used on each node typically corresponds to the number of processors available on that node. For example, in a cluster of 8 nodes, each with 4 processors, a total of 24 threads are used to solve the task.
One more reminder that the HPDM nodes in SAS Enterprise Miner are free to use in SMP mode but they require SAS High-Performance Data Mining license when used in MPP mode.
Now that we've defined HPA and its benefits, let's understand how it differs from other solutions like SAS Grid and SAS In-Memory Analytics.
The SAS Grid environment has multiple nodes too, but the setup is different with a grid control server, multiple grid nodes, a central file server, a Metadata server, and so on. SAS Grid manages the workload by distributing SAS programs or steps (PROC or DATA step) within a SAS program to available nodes in the grid. Thus it provides workload balancing and high availability (among other features) in a multi-user environment. Note that the data in this case is not large and the grid control server sends the SAS program and its associated data to a single grid node (that has minimal workload at the time) for processing.
NOTE: The word “grid” sometimes refers to a cluster of nodes in HPA terminology; this document uses the term “distributed computing environment” in its place to avoid this confusion.
SAS Enterprise Miner can be configured to enable both HPA and SAS Grid but you need to figure out which method to use for the problem at hand. For projects with lot of models or multiple projects from lot of users with small to medium sized data, SAS Grid should be used. For projects with big data, you should consider using HPA on MPP.
In the scenario when HPA is enabled using the MPP architecture, SAS Enterprise Miner acts as a client with the majority of work done on the MPP hardware. If SAS Grid is also enabled (in addition to HPA), all jobs are sent to the grid control server which then schedules them on a grid node. In this case, the grid node acts as a client with the majority of work done on the MPP hardware. Remember that this scenario only applies to HPDM nodes that are designed to take advantage of MPP. Finally, if you have both HPA and SAS Grid, know that they are typically architectured on separate hardware.
SAS In-Memory Analytics products such as SAS Visual Analytics, SAS Visual Statistics or SAS In-Memory Statistics use multiple nodes and their setup is similar to MPP architecture -- but the big difference is where the data resides. In HPA, the data resides in memory or memory and disk depending on the size of the data. But for SAS In-Memory Analytics products, as the name suggests, the data is entirely in memory (memory distributed across multiple nodes). Hence this has the fastest response times among the three but also places huge requirements on memory.
All SAS In-Memory Analytics products come with an underlying server called the LASR Analytic Server. When HPA is installed alongside LASR Analytic Server (on MPP), there is an additional performance gain if the data is already managed by the LASR Analytic Server and is loaded into memory. The initial transfer of data to the HPA process is a memory-to-memory operation which is much faster than the otherwise disk-to-memory operation.
To take advantage of the LASR Analytic Server in this setup, the HPDM_LASR macro variable should be set to Y, as shown below in the SAS Enterprise Miner Project Start Code.
%let HPDM_LASR = Y;
The options for HPA and SAS Grid are located at different places in the SAS Enterprise Miner user interface. Below are some screenshots for reference.
The macro variables for HPA are found in Project Macro Variables property.
And the available settings for the SAS Grid are located in Options >> Preferences dropdown.
Each of these solutions (HPA, SAS Grid, and SAS In-Memory Analytics) have their set of benefits and limitations; but it is important to understand that they have different architectures and fulfil different needs:
Of course you'll choose one or more of these solutions based on your requirements and business needs.
To learn more about the HPDM nodes and their operation in SAS Enterprise Miner, stay tuned for the next tip in this series!