SAS High-Performance Analytics tip #2: HPDM nodes in SAS Enterprise Miner

4 Likes

After introducing the architecture of SAS High-Performance Analytics (HPA) and the terminology surrounding this topic from an earlier tip, let's dive into the inner workings of SAS Enterprise Miner nodes under the High Performance Data Mining (HPDM) tab, also referred to as HPDM nodes. This tip provides details about configuring and loading data into an HPA environment and the behavior of SAS Enterprise Miner when mixing HPDM and non-HPDM nodes in a flow.

How to configure HPA

The first step is to configure and validate the connection between SAS Enterprise Miner and the distributed computing environment where HPA is installed. To configure, set GRIDHOST and GRIDINSTALLLOC options in the Project Start Code property of SAS Enterprise Miner project. Note that you will need a SAS High-Performance Data Mining license to use SAS Enterprise Miner in Massively Parallel Processing (MPP) mode.

option set = GRIDHOST="rdu001.unx.sas.com";
option set = GRIDINSTALLLOC="/opt/v940m3/INSTALL/TKGrid_REP";

GRIDHOST specifies the name node of the distributed computing environment and GRIDINSTALLLOC specifies the install location of HPA software.

To validate that SAS Enterprise Miner is accessing HPA on rdu001, use the HPATEST procedure as shown below. In SAS Enterprise Miner, drag a SAS Code node and copy the following code snippet into the Code Editor window.

proc hpatest data=sashelp.iris;
  performance nodes=all;
run;

If SAS Enterprise Miner can access the HPA environment, you will see the following information in Results >> Output window. Note that the IRIS data set is from SASHELP library and thus available in the SAS Enterprise Miner.

            Performance Information
 
Host Node                  rdu001.unx.sas.com
Execution Mode             Distributed
Number of Compute Nodes    8

In addition, a note in the log provides similar information.

NOTE: The HPATEST procedure is executing in the distributed computing environment with 8 worker nodes.
NOTE: There were 150 observations read from the data set SASHELP.IRIS.
NOTE: The PROCEDURE HPATEST printed page 2.
NOTE: PROCEDURE HPATEST used (Total process time):
      real time           9.62 seconds
      cpu time            6.19 seconds

If SAS Enterprise Miner cannot access the HPA environment, the note in the log contains a message saying the procedure is executing in a single-machine mode as below. When this happens, please correct your installation and/or configuration settings before proceeding.

NOTE: No distributed computing environment detected. The NODES=ALL option in the PERFORMANCE statement is ignored. Use the OPTION SET=GRIDHOST=' ' statement or the GRIDHOST= option in the PERFORMANCE statement to specify the host for distributed computing.
NOTE: The HPATEST procedure is executing in single-machine mode.

How data moves in HPA

First, let's define a few terms:

(1) “Data appliance” refers to the database software (Oracle, Teradata, and so on) or HDFS (Hadoop Distributed File System) running on a cluster of nodes where data is stored in a distributed fashion (one file partitioned and stored on multiple nodes)

(2) “Compute appliance” or distributed computing environment include a cluster of nodes where analytical tasks are performed on distributed data.

In a typical customer environment, the data and compute appliances are kept separate as you require high availability and fast response times on the data appliance and do not want to overburden it with analytical computations.

When performing an analytical task, the first step is to move the data to the compute appliance where HPA is installed and configured. The input data can typically reside at one of three locations – on the local client, the data appliance, or the compute appliance. When the data and compute appliances are different, the data moves on the network, in parallel (if distributed), to the compute appliance. On the other hand, if the data and compute appliances are on the same cluster of nodes and all the nodes in the cluster are used for computing, then there is no movement of data and hence a faster load operation results.

Note that it is not necessary that the data and compute appliances have the same number of nodes or that the compute appliance use all the nodes when it is collocated (on the same node) with the data appliance. In such situations, data moves on the network to reach the compute appliance.

Finally, HPDM nodes run in HPA environment (or MPP) when input data is from the data appliance. In other words, they run in single-machine mode (or SMP) when data resides locally on SAS Enterprise Miner server.

Sample in SAS Enterprise Miner

When HPA is enabled and running in MPP mode, SAS Enterprise Miner internally creates a data sample for the Input Data node. The Sample and Sample Options properties of the Input Data node can be used to change the size of this sample, which by default is 10K observations.

One of the reasons SAS Enterprise Miner creates a sample on the client when running on a distributed environment is to support certain operations in the process flow diagrams that cannot be executed on full data. For example, the sample is used when you click the Imported Data and Exported Data properties of the node. In addition, when you mix HPDM and non-HPDM nodes in a flow; the HPDM nodes use full data and run on the distributed environment while the non-HPDM nodes use the sample and run on the client**.

Also, when mixing HPDM nodes with non-HPDM nodes, SAS Enterprise Miner does not permit ambiguous connections within a flow. In Figure 1 below, the “Invalid Link” error is displayed when you try to connect the Filter node to the HP Regression node. The failure occurs because the Filter node -- being a non-HPDM node will use the sample, while the HP Regression node will use full data.

**The term client is used to refer the SAS session used by SAS Enterprise Miner. This session acts as a client to the distributed environment where the actual processing of HPDM node(s) is done.

A subset of nodes like the Metadata, Model Comparison and Score nodes can be used with non-HPDM nodes and HPDM nodes. But when mixing other non-HPDM nodes with HPDM nodes, keep in mind the sample that is persisted and used by SAS Enterprise Miner.

Lastly, after executing a modeling HPDM node, fit statistics are computed both on the sample and full data. This is done to enable model comparison between the mix of HPDM and non-HPDM nodes in the flow.

Model Comparison node: Selection Data property

The Model Comparison node compares multiple incoming models to pick a champion. The Selection Data property in the Model Comparison node is used to specify the fit statistics used to compare and select the champion model. The values for this property are: Default, Sample and Grid.

When set to Default, fit statistics based on full data are used if all incoming models are from HPDM nodes. On the other hand, if the models are a mix of HPDM and non-HPDM nodes or from non-HPDM nodes only, then fit statistics are based on the sample data. If the Selection Data property is set to Sample, then fit statistics are based on sample data irrespective of the type of the incoming nodes. Lastly, when set to Grid, fit statistics are based on full data from the HPDM nodes; the non-HPDM nodes are ignored in this case.

See below for a concise mapping of this information in a tabular format:

Using the HPDM nodes, SAS Enterprise Miner can be extended to use the distributed computing environment when the need arises for big data. Though this tip centered on using HPDM nodes in a distributed environment, note that they provide performance boost (compared to non-HPDM counterparts) even in a single server environment by taking advantage of the threading capabilities of available CPUs/cores.

The next tip in this series will walk through a modeling example using HPDM nodes.