Top 10 Data Management Process-Tuning Techniques

5 Likes

In a previous article, The Node Less Traveled: Improving Job Performance with Node Profile Logs, I described the use of node profile logging to help pinpoint underperforming areas of SAS Data Management Platform jobs. In that article, however, I only hinted at corrections that could be made to eliminate inefficient code and incorrect settings that might be contributing to the problem.

Since I don’t want to leave you stranded in your attempts to optimize Data Management Platform jobs, I’ve compiled a “top 10 list” of process-tuning techniques. With these tips, you’ll have your data management jobs performing at top speed in no time.

Parse First: Data parsing can be an expensive operation depending on the nature of the data you are attempting to transform. If, as part of your data quality processing, you plan to standardize data and then generate match codes for it, consider parsing the data in an initial step and then use the Standardization (Parse) and Match Codes (Parsed) nodes in subsequent steps. Choosing the more typically used Standardization and Match Codes nodes together instead (not their “Parsed” versions) will result in parsing the data twice as that operation will occur behind the scenes for both of these nodes with most of the standardization or match definitions you might decide to use.
Explore Enrichment Options: Data enrichment in the Data Management Platform refers to address verification, geocoding and phone number area code validation. These processes can be resource intensive. Depending on the combination of enrichment processes selected, you may see a noticeable decrease in job performance. To mitigate the additional expense of data enrichment processing, be sure to take advantage of a few options that are designed to improve performance. In some address verification nodes, for example, you can load the enrichment reference data or indexes into memory; this results in faster lookup times. You may also choose to use “distributed” versions available for some enrichment nodes that can offload processing to a secondary server, freeing up resources on the primary server for the rest of the job flow processing.
Try Parallelization: The process job layer of the Data Management Platform is perfect for executing discrete units of work in parallel processes using the Parallel Iterator node. Instead of cleansing 1 million rows of data in a single thread, break the data up into 10 equal chucks and parallelize the work across multiple threads. The gains found using this technique can be substantial if your server has untapped resources that can be exploited.
Hone Expression Code: The Expression node in the Data Management Platform acts like a Swiss Army knife—it can help you get the job done when other tools aren’t available. Because the expression language functions as a basic programming language, you have the ability to author simple or sophisticated custom logic in the Expression node. This flexibility won’t prevent you from writing inefficient code that can inadvertently slow down overall processing times. Where possible avoid the following pitfalls:
- Don’t use the dbconnect object in such a way that database connections and query preparation occur on every data row. Do that work in the Pre-processing step of the Expression node instead.
- Along the same lines, avoid compiling your regular expressions on every data row unless data row values make up part of the regular expression pattern logic. In addition, poorly constructed regular expressions may spend too much time working through patterns so find time to test and tune them before putting them into production.
- Use the pushrow function judiciously. The pushrow function will create additional rows in your data flow and it is sometimes used to summarize values after evaluating groups of data rows. Used incorrectly the pushrow function can generate any number of unneeded data rows that will only add to the total number that will ultimately need to be processed.
Allocate Memory: Apportion as much RAM as the host system can support to Data Sorting, Data Joining, Branch and Clustering nodes. Performing these operations fully in memory can dramatically improve node performance. When system memory is not available or has not been allocated to the correct level, Data Management Platform jobs will start to write temporary files to disk, which is much slower than memory-based processing. Also consider using multiple threads for processing. Each of these nodes except for the Branch node has a sort threads option. A good rule of thumb is to set the number of threads to twice the number of processor cores.
Minimize Branching, Joining, Unions and Sorts: You may not be able to completely eliminate the need to use join, sort and union logic. Likewise, it may be impossible to totally remove the need to use workflow branches in your process or data jobs – but you should strive to minimize them. The overuse of these nodes is notorious for making otherwise speedy processes unacceptably slow. If you find that project requirements dictate the use of these nodes, keep in mind what you have just learned about memory allocation and consider these additional tips:
- Because data is streamed through all nodes simultaneously in a Data Management Platform job rather than completely running through each node before being passed to the next, disable the “land data locally” option in Branch nodes unless you specifically need to make sure all data is fully processed before moving to the next node.
- For Data Union nodes, unless you have a reason to keep data in distinct sets from the opposite sides of the union, set the Alternate advanced property to “true,” which will improve performance.
- There is a memory load option for Data Joining nodes that lets you load one of the tables involved in the join into memory for improved performance. This works best if the table you choose to load into memory is relatively small compared to the other table in the join.
Size Data Fields Appropriately: More data moving through a job slows down performance. Set up your jobs to accommodate only data fields needed for processing. Configure fields with the minimum size needed for the data (profile the data first to help with this task). Text File Input nodes, for example, will default to 255 characters for STRING fields. Adjust the lengths to fit the data rather than relying on these defaults. When you are through using selected fields in the logic of your data flow, use the Field Layout node or the Additional Outputs option found in some nodes to prevent no longer needed fields, and the data they contain, from being passed to subsequent steps in the job.
Tune the Database: Particularly at larger data volumes, slow database interactions can really hinder attempts to speed up overall job performance. If your data management job makes use of a database for any part of its data flow, be sure to design your database structure in way that is appropriate to your project (on hardware that fits requirements) and then tune it as needed for optimization. Common activities done through database management tools to increase performance might include adding table indexes, optimizing queries, or regenerating database statistics after data inserts or updates. Modifying database-related settings in various Data Management Platform nodes like the Data Target (Insert) node can also help. Commit Interval and Bulk Insert options are available to let you control the commitment level of data moving between the database and the Data Management Platform. Larger commit intervals and bulk insert options will often improve performance at the expense of more manageable database rollbacks if something fails during the insert or update process.
Evaluate OS and Network Performance: Don’t forget to measure various aspects of your host hardware, software, and network environment to make sure it’s up to the task. Some of the typical issues that could negatively impact data management job performance are highly fragmented disk storage, active virus protection software, firewall settings, network bottlenecks, overworked CPUs and inadequate database sizing or responsiveness. If you believe your Data Management Platform job to be suffering from negative external factors, work with the IT staff in your organization to rule out these kinds of issues.
Review Match Conditions: One of the larger drains on performance for jobs that perform matching using the Clustering node is specification of a very large number of matching conditions. More conditions generally mean slower jobs so remove redundant rules or rules that are unlikely to identify additional matches due to sparsely populated data. The clustering node will also temporarily write any fields passed through the node (but not used in clustering conditions) to disk during the clustering process. As a consequence, where possible you should limit the number of additional non-match condition fields you expose to the Clustering node.

There you have it—a list of some of the most effective ways to speed up data management job performance. Many of these techniques don’t even require reworking of you job logic; you only have to change a few node options. Some of these tips do however require a rethinking of the job data flow. Understanding that “forewarned is forearmed,” review these suggestions before the design phase of your next project and you will be running at top speed from the very beginning.

Do you have any additional tips? Have you used any pointers mentioned here and seen improvements? Share your thoughts below.

Top 10 Data Management Process-Tuning Techniques

Free course: Data Literacy Essentials

Get Started