BookmarkSubscribeRSS Feed
Shakti_Sourav
Quartz | Level 8

Hi Team,

 

Cluster Node taking lots of time to merging the output. I have attached screenshot for your references and please suggest how to develop the performance in Cluster node ??

 

Shakti_Sourav_0-1662526809290.png

 

Let me know if required any further details.

 

Thank You

Shakti

 

4 REPLIES 4
VincentRejany
SAS Employee
Depending on the size of data you are processing you can play with different advanced parameters, typically BASE/SORTTHREADS, CHUNKSIZE, BASE/SORTBYTES
Shakti_Sourav
Quartz | Level 8

I have checked with your suggested parameters, Only chunk size is mentioned and all are defined Null.

Please suggest what size should give in BASE/SORTTHREADS, BASE/SORTBYTES and Chunck size 90 Millions Records.

 

I have attached screenshot for your references.

 

Shakti_Sourav_0-1664001306270.png

 

 

VincentRejany
SAS Employee

Found these recommendations

 

Feature or Setting

Effect

DataFlux Software Environment

Cleansing Transformations

Some transformations done for cleansing purposes can slow down processing times if used incorrectly. For example, as discussed in more detail below, certain processes like standardization may parse data (depending on the data type) prior to generating a standardized version of the data. You can use a “Parsing” node to parse the data first (or maybe the data is already parsed in the input) and then pass tokens directly to the “Standardization (Parsed)” node, side-stepping the parse.

Similarly, there is no need to standardize data prior to generating a match code for it. Match codes have standardizations built into them unique to each data type.  However, if you are performing the standardization because you want to transform the data and persist the change, there may be no way to avoid the standardization.

The number of standardizations, gender guesses, and identification processes you do directly affects performance. Limit their uses as much as you can. You may also choose to use simple “Expression” node functions to perform actions like uppercasing a value or trimming a trailing space rather than using a “Change Case” or “Standardization” node if the transformations are simple and don’t need to be “locale aware.”

Parse Resource Level and Depth

Parsing data elements into smaller tokens (turning a full name field into a given name and surname field) can be resource intensive. Longer input data elements—addresses for example—take longer to process.

To improve parsing performance, you can adjust the resource level and scoring depth for each job node that uses parsing. This includes the “Parsing” node, the “Identification Analysis” node, the “Standardization” node, and the “Match Code” node.

Lowering the resource level will instruct DataFlux to uses less memory thereby freeing it up for other processes. The scoring depth setting can be set lower so that the parsing engine will not consider as many parsing “solutions” and that will reduce the overall time needed to complete parsing operations.

For relatively clean data, both of the steps can be taken to improve performance. For “dirtier” data or more complex data, making these adjustments can lead to incorrectly parsed data elements.

Parse Strategy

As we have learned, data parsing is resource intensive. However there are steps that can be taken to reduce the burden of parsing on system resources.

If you plan to parse a data element and then generate a match code for it, you are better off parsing the data element first and then use the “Match Codes (Parsed)” node in DataFlux. Otherwise, the data element will be parsed twice. The first pass will parse the element into tokens and the second pass will parse the element again when generating the match code. If you use the “Match Codes (Parsed)” node, you can pass already parsed data into the standard match code node and side-step the second parse action.

This advice holds true for gender analysis and standardization processing as well. The both have “Parsed” input options and should be considered if your data is already parsed or if you plan to parse it first in your DataFlux work-flow.

Match Criteria Complexity

Matching criteria refer to the combination of fields chosen for generating match codes bound for clustering and the clustering conditions themselves. Clustering is a word used to describe the action of finding similar data elements in rows of data and grouping those rows together using a unique key to signify that they have a relationship.

If you decide that you need to generate match codes for many fields (match codes get you the “fuzzy” match in DataFlux), your DataFlux jobs will run more slowly than if you choose not to use match codes. However, match codes are integral to DataFlux matching so you will want to take advantage of them but plan their use wisely

Clustering conditions are those rules you define that indicate what constitutes a matching record. For example, a common clustering condition for customer data would take into consideration a full name and uniquely identifying parts of an address. The more “AND” and “OR” conditions you use, the harder DataFlux will work to find the matches. This slows performance down and uses more resources.

Survivorship Rule Complexity

Once matching data rows have been identified and clustered, a common activity is creating a single best record (survivor) out of those matching rows. In DataFlux, you can author as complex survivorship rules as you wish but as they grow in number and complexity, performance will decrease.

DataFlux Process Memory Allocation

If lots of physical memory is available on the computer hosting DataFlux, allocating a good portion of the memory to DataFlux can really increase performance. For example, if you allocate 512MB RAM to DataFlux cluster processing, DataFlux can do a lot more of the work (and in some cases, all of the work) in memory which is very efficient. When DataFlux runs out of memory for the clustering, sorting, and joining of data (three memory intensive processes), it will then begin to write temporary files to disk. This process is much slower than in-memory processing.

The caveat to allocating lots of memory to these kinds of processes is that you can over allocate the memory by accident. If you give two nodes in the same work-flow each 600MB of memory and you only have 1G available, the DataFlux job will not complete because it can’t allocate the specified amount of memory.

Delaying Memory Allocation

One way to deal with accidental over allocation of memory is to use the “delay memory allocation” option in “Clustering” nodes. If you have several clustering nodes in a single work-flow, you can instruct DataFlux to only allocate the memory when data gets to each clustering node rather than pre-allocating the memory for both nodes prior to job execution. In this way, the same memory can be used for multiple clustering processes. The downside to using this option is that if data gets to the second clustering node while the first is still processing, your job will show an error except the error will occur mid-run instead of prior to execution as would be the case if you were not using this option.

Multi-Threaded Sorting/Joining/Clustering

One of the best ways to improve performance of DataFlux processes that use clustering, sorting, and joining is to enable multi-threaded processing. Data processing can be distributed across cores and across threads so that processing can be done simultaneously, breaking the main process into smaller more efficient chunks of work.

In some cases we have seen 10x to 20x performance improvements using these options with joining, sorting, and clustering nodes.

It’s suggested to set the number of threads to 2 times the number of cores. Then you can allocate how much memory can be used by each thread. Depending on resources of the computer hosting DataFlux, it may be possible to allocate quite a bit of memory for each thread.

Process Log and Debug Log Disablement

While they can be very useful when designing DataFlux work-flows, various logs generated by DataFlux (and by database drivers) can slow performance. When you are ready to move things into performance test phase or ready to move things into production, only enable the logging level needed for the project.

Certain log items like database driver traces can really reduce performance. Additionally, if you are using DataFlux “dac” logging (data access logging), this can slow performance as well. It should be turned off when not needed for debugging.

Memory Loading Small Tables

Inside DataFlux work-flows you may often find it necessary to join two branches of your work-flow together. When one side of the branch is relatively small and is being joined with a much larger set, you can choose to load the smaller data set fully into memory. This should improve performance for these types of activities. This option is available in the “Data Joining” node of DataFlux.

SQL Query Inputs and Parameterized SQL  Queries

Where possible, you should take advantage of most databases’ ability to sort and join very efficiently. While you can certainly sort and join data in DataFlux, using the database to do this has real benefits.

For example, if your data input to a DataFlux work‑flow entails reading a subset of data from the database, you could either use a “Data Source” node in DataFlux that reads in all rows of data into your work-flow and then filter that data with a “Data Validation” node; or you could use a “SQL Query” node and write a SQL statement that selects and sub-sets the data. The latter approach is much more efficient and performs much faster.

Using these nodes successfully means that a certain amount of performance tuning is required in the database itself.  Proper indexing, partitioning (as needed), compression, and table analysis/statistics are all crucial to solution performance.  With Oracle specifically, hints and related features may be embedded in queries issued from DataFlux nodes.

Text File Outputs

Where possible, significant performance gains can be had under certain circumstance by writing results ultimately bound for a database table to a text file first. Use of the procedure depends on the context but, for example, if you process 500K rows of data and then plan to update a database table with those 500K rows, you can first write those rows to a text file in the first page of your batch work-flow. In the second page you can read from that text file and write the updates to the table.

There are several reasons why writing to text files may improve performance but one significant reason is that memory is released and reallocated for each page of a work-flow. If you are performing lots of complex operations on one page, you can spread the processing out to several pages and use more memory per process giving DataFlux a better chance of being able to do more work in memory without resorting to temporary file creation on disk.

Bulk Loading Database Table Inserts

In some cases where many rows of data need to be loaded or updated into a database, you can use a bulk loading feature built into the “Data Target (Insert)” node to drastically improve performance. It does not work with all database types but it is supported on Oracle to name one option.

You can set the maximum number of rows to bulk load to “50000” and set the commit interval to “1.” This will bulk load batches of 50000 rows of data into the database very quickly because the frequent communication with the database server as would happen with smaller packets of data can be avoided.

Pass-Thru Field Reduction

This is common sense but sometimes forgotten: the more data that passes through your work-flow, the slower it will perform. If you have 20 fields in an input table, it is easy to select all those rows and pass them through your entire work-flow. But do you really need all those rows? If you do not plan to output a row or do not need it for an operation in the work-flow, be sure to clear it from your input step. A work-flow that only uses 10 fields will perform more quickly than one with 20 fields.

Expression Code Complexity

The “Expression” node in DataFlux is a great utility tool but can also lead to performance bottlenecks. Inefficiently written expression code can slow down overall job performance. There certainly may be times when you need to use it but you should rely on the other nodes to do the heavy processing where possible since they were designed for performance with specific purposes in mind.

Some common “Expression” node actions that can perform slowly:

  • Regular expressions
  • Grouping
  • The print() function used for each row
  • Inadvertent use of the pushrow() function
  • Reading and writing to/from text file objects incorrectly

Work-Flow Branching

Work-flow branching is a necessary part of building complex batch jobs and real-time services. If these braches can remain independent then there is minimal performance implication (see below). However, it is often the case that data in branches need to be joined back together and this can decrease performance if done too much.

The problem can become more acute when working with real-time services or embedded jobs. Because services can only pass back data from one branch to a calling job or application, a join or union is often required to bring back together all desired output in one branch. Try to design your jobs to avoid this performance hit if possible.

It is possible to allocate extra memory for the “Branch” node itself. Similar to other nodes, this node will use physical memory to separate data into individual work‑flow segments in DataFlux. When the allocated physical memory is used up, DataFlux will begin to write temporary files to disk. By allocating more memory to branching operations, you should be able to improve performance. But like most suggestions here, this requires a balancing act. You would have to add up memory allocations for all sort, join, cluster, and branch nodes in a single page of your work-flow to understand how much you have to work with for each individual process.

Alternating Union Rows

There is an advanced option in the “Data Union” node called “Alternate” that can be set to “true” to improve performance. If it is set to “false,” all rows from one side are cached in memory (and then to disk is memory runs out) and the other side are added to the bottom of the complete set. With this set to “true” rows are alternated in the output of the node and caching can be avoided.

Database Environment

Database Bulk Loading

Testing has shown that certain bulk load utilities as can be found in Oracle and other database can be very useful when loading large amounts of data into database tables. Oracle has a utility called “SQL Loader” that can load data very quickly. For initial loading of the Customer MDM hub, you might consider processing data through DataFlux to clean, cluster and survive it but then write that output to text files as previously described. Then those text files can be used as input for SQL Loader and the data can be loaded in to the database at top speeds.

Hub Optimization

Depending on the entities that are tracked in the Customer MDM hub, you may end up with many tables and views to support those entities. It will be critical to set indexes correctly to get the best performance for lookups into the hub both for data updates and for “suppression” type processing.

Setting up entity metadata will also play a part here. Look closely at how data types are defined for each attribute and see that the right type and size are being used for each. Avoid defaulting to string(255) if possible – be more specific with the expected size of data coming into the hub. Use data profiling techniques to help set these numbers.

Check to see that attribute relationships (how certain attributes will be transformed into others, for matching purposes for example) are set up correctly as well so no extra attributes are being carried around in the work-flows or the hub database itself.

Database Drivers

Look closely at the way the database drivers are set up for your database type. Certain drivers for Oracle and other databases have settings for optimized performance. For example, you can set a larger “array size” and turn off “enable scrollable cursors” in the Oracle Wire Protocol driver to get some performance benefits. Native drivers will have similar performance settings as well and it is worth reviewing the documentation to see what the optimal settings are.

Don’t forget to disable driver logging and tracing at runtime as leaving them on can dramatically decrease performance.

RacheLGomez123
Fluorite | Level 6
You can add a Clustering node to a data job to create a cluster ID that is appended to each input row. The node accepts any number of field inputs arranged in any number of logically defined conditions. These cluster IDs indicate matches. Many rows can share cluster IDs, which indicate these rows match using the clustering criteria specified. Match code fields are often used for clustering as they provide a degree of fuzziness in the clustering process. The cluster engine uses 64 bit data types. It can handle input data sets that may result in over 1 billion clusters.

Note that clustering is the first step in entity resolution. It is used to put groups of related records in clusters by assigning each a cluster ID. Records in a set with the same cluster ID are considered to be in a cluster. Each cluster is treated as a group that is processed by other entity resolution nodes after the data is clustered . One node that works with clustered data is the Surviving Record Identification node. This node looks at the cluster and chooses a surviving record (which is an entity) and flags it. After that, the clusters can out to an entity resolution file. You can then examine the entity resolution file and manually select the surviving records.

A cluster can be created based on a single entity, where an entity is a single field or a collection of fields concatenated together. It can also be created based on a multiple entity, where matches within a condition are made across all entities of that condition. For example, a cluster can be created in which Field 1 would be one condition, and Fields 2 and 3 would be two separate entities of a second condition. For more information, see Set Clustering Properties.

Once you have added the Clustering node, you can double-click it to open its properties dialog. The properties dialog includes the following elements:

Name - Specifies a name for the node.

Notes - Enables you to open the Notes dialog. You use the dialog to enter optional details or any other relevant information for the input.

Output Cluster ID Field - Specifies the name of the field in which new clusters will be inserted.

Output clusters - Specifies whether all rows, only those that are in a cluster by itself, or only those that are in a cluster with at least another row are produced on the output. The Clustering node displays the number of rows it skipped in the output as part of the final status of the node. This count is shown at the end of the job run. Note that the count is relevant when the node is configured to skip single or multi row clusters.

Regards,
Rachel Gomez

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 964 views
  • 0 likes
  • 3 in conversation