You’ve just designed a complex data management process in DataFlux Data Management Studio (a client application in all SAS Data Management offerings) for your organization. It has over a hundred nodes (data access or transformation logic), performs intricate data quality corrections on millions of rows in minutes and provides data to several key operational systems. Your project manager just told you that the process needs to handle twice the number of rows but finish in half the time using the same hardware. You consider the consequences of jumping on the first bus to Cleveland, never to be heard from again.
But don’t buy that ticket just yet. A close inspection of the design of the data management process may lead you to places in the data workflow where logic can be optimized, transformations reworked, and redundant analysis removed. The trick is understanding where the potentially underperforming nodes are located in your large data management process. Standard Data Management Platform logs provide a wealth of information but do not record how long a process is spending in each discrete node of the job. There is, however, a node profile log feature that once enabled delivers this type of information for both Data Management Studio and Data Management Server.
The node profile log option generates detailed statistics logs for nodes in process and data jobs. As you run jobs or data services in the Data Management Platform environment, the log captures how much time is spent in operations for each node. It further breaks this down into segments for cumulative time spent processing rows, time spent preparing the node and time spent pre-executing logic for the node. These logs are extremely useful for finding performance bottlenecks in data management processes.
There are two node profile log options:
Once your system is generating these logs, they become invaluable tools to help you pinpoint potential performance issues. Here are some of the things you can look for in the log data to locate trouble spots in your processes.
To turn on node profile logging functionality, edit the appropriate Data Management Server or Studio configuration file (service.cfg, batch.cfg, or app.cfg) by adding the following:
You’ll also need to add the code shown below to the log file format XML file to make sure the data is logged in the correct output file. Refer to Data Management Platform documentation to verify the correct XML file (it will correspond to the configuration file described in the previous paragraph – service.log.xml, batch.log.xml, or dfwfproc.log.xml). The details of the appender can be altered to meet your needs. Both logging options will write to the log under the DF.RTProfiler heading at trace level.
To turn this logging on for Data Management Studio for example, add one or both of the PROFILEBYNODE and PROFILEBYTYPE options to your app.cfg file and add the following XML to your dfwfproc.log.xml file:
<appender class="RollingFileAppender" name="ProfFile">
<param name="Append" value="false"/>
<param name="ImmediateFlush" value="true"/>
<param name="ConversionPattern" value="%m"/>
Upon restarting Data Management Studio and after running a process, this will create a “profiler_job_[datetime stamp].log” file in the “C:\Users\[user]\AppData\Roaming\DataFlux\DMStudio\studio1” directory, which is the default logging directory for Data Management Studio.
Note that on Windows computers, use of node profile logging will disable correct functioning of process jobs that use parallel processing. Under these circumstance, you may see that where you expected two or more active threads to be running simultaneously, only one appears to be utilized. Be sure to disable this logging option for production environments.
For the Node Type log option, the output will be comma delimited and the columns are as follows:
For the Node Instance log option the output will be comma delimited and the columns are as follows:
Here is an example of the Node Type profile log with column headers added for clarity:
OPERATION, NODE TYPE, TIME, ENTRIES
Here is an example of the Node Instance profile log with column headers added for clarity:
OPERATION, FILENAME, NODE INSTANCE, TIME, ENTRIES
The format of a node profile log lends itself to further analysis in your favorite spreadsheet or reporting application. Here we see the details of a node type profile log on the left with a pivot table and chart on the right linked to the data table on the left. The pivot table combines timings on modes of operation for each node to give a full picture of the time spent in each node type. The chart is linked to the pivot table and gives a graphical representation of the log data, allowing you to quickly see what types of nodes account for the bulk of the processing time in the associated job. In this example, it looks like someone needs to tune a database query judging by the time spent in parameterized query (ODBC_PARAM_QUERY) nodes.
Getting the most out of your data management processes is a must. Enabling node profile logging makes it much easier to work through the steps of locating and correcting underperforming areas of your data management jobs. With some deft analysis and corrective rework of your data management process, you’ll notice a substantial increase in job performance. With improved processes, you’re well on your way to earning a much-deserved vacation to sunny Miami courtesy of the same manager whose project you just rescued.