Solved: Dataflux Data Management Server 2.6 - Long Running Joins - Performance...

LuisLeiteBR · Posted 03-28-2017 12:55 PM

Hi Guys,

We´re using Dataflux Data Management 2.6 to make a join sequence (Left Join) for 8 Fixed Width files (see the JOB Image attached):

File 1 (VINCULOS) 72 Million Lines (232 bytes each) - The Base File
File 2 (REMUNE) 600 Million Lines (48 bytes each)
File 3 (DECIMO) 100 Million Lines (42 bytes each)
File 4 (HORAS) 190 Million Lines (34 bytes each)
File 5 (CONTRI) 80 Million Lines (60 bytes each)
File 6 (AFASTA) 8 Million Lines (88 bytes each)
File 7 (DESLIGA) 72 Million Lines (116 bytes each)
File 8 (TRABAL) 72 Million Lines (164 bytes each)

Files from 2 to 5 are submitted to a sort and a Surviving Record Identification, that will insure a maximum of 72 million lines for each one, before the Join.

As you can see in the images attached:
- The Expresion red nodes are log counters for each 1 million lines read,
- The orange nodes are for Sorting,
- The Expression green nodes creates a few attributes,
- The Surviving grey nodes uses Filed Rules based on "not Null" for some attributes, with no previous output,
- The Blue Data Joining nodes are one based attribute Left Joins, with no memory load option,
- The Output Fixed File Purple Node has 72 million lines, with 785 bytes each

Using that design, the job runs in 5.5 days in a Dataflux Data Management Server with 128 GB RAM, 20 cores and 13 TB free of disk.

We´re using, as primary definition, the server default values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS.

So, I´d like to ask:

- Is that successive join design the right approach to this situation? Separating the Joins and creating intermediate files would be a better performance solution?

- Based on the server configuration, what values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS should I may consider?

- Any suggestions to improve performance for this Job?

Thank you in advance for any contribution to improving the performance of this JOB.

Luis Leite

RonAgresta · Posted 03-30-2017 07:00 AM

HI - here are a couple of things to consider:

The memory defaults for sort and join nodes are really low and you have a lot of memory available. Make some estimates on the amount of RAM you need for your OS, subtract that from 128, and then divide the remainder across any memory options in your join, cluster, sort, and branch nodes. THe survivor node does not need a lot of memory unless you expect that you will have very large match clusters. Watch the units on the memory options. If I remember correctly, some are in bytes while at least one is in megabytes.
There is a log option you can set that will give you node profile metrics for every step in your job. Enabling this will help you determine what nodes in your job are taking the most time. You can then focus memory there.
You're performing a lot of disk I/O by reading from very large files and writing out a very large file. Preferably those files are on the same machine as DM Server/Studio but if they can't be, verify you have optimal network configurations and disk I/O settings.
Because all of the daa sets in your joins are relatively large, I'm not sure you'll gain much by attempting to load one side into memory. This normally works well when one side of the join is much smaller than the other.
Have you considered an alternate approach where you bulk load the data into a database and perform joins there in indexed columns? It may be worth a try even though you would need read data out of the database to perform the survivorship.

Ron

View solution in original post

RonAgresta · Posted 03-30-2017 07:00 AM

HI - here are a couple of things to consider:

The memory defaults for sort and join nodes are really low and you have a lot of memory available. Make some estimates on the amount of RAM you need for your OS, subtract that from 128, and then divide the remainder across any memory options in your join, cluster, sort, and branch nodes. THe survivor node does not need a lot of memory unless you expect that you will have very large match clusters. Watch the units on the memory options. If I remember correctly, some are in bytes while at least one is in megabytes.
There is a log option you can set that will give you node profile metrics for every step in your job. Enabling this will help you determine what nodes in your job are taking the most time. You can then focus memory there.
You're performing a lot of disk I/O by reading from very large files and writing out a very large file. Preferably those files are on the same machine as DM Server/Studio but if they can't be, verify you have optimal network configurations and disk I/O settings.
Because all of the daa sets in your joins are relatively large, I'm not sure you'll gain much by attempting to load one side into memory. This normally works well when one side of the join is much smaller than the other.
Have you considered an alternate approach where you bulk load the data into a database and perform joins there in indexed columns? It may be worth a try even though you would need read data out of the database to perform the survivorship.

Ron

LuisLeiteBR · Posted 03-31-2017 03:04 PM

Hi, Ron.

Thank´s a lot for your guidance!

We´ve been using BYTES as units in memory options. I´ll check them.

The Joins and Sorts are the most time-consuming nodes, so I´ll try some adjustments based on your advice.

Luis Leite

Dataflux Data Management Server 2.6 - Long Running Joins - Performance Adjustment

Re: Dataflux Data Management Server 2.6 - Long Running Joins - Performance Adjustment

Re: Dataflux Data Management Server 2.6 - Long Running Joins - Performance Adjustment

Re: Dataflux Data Management Server 2.6 - Long Running Joins - Performance Adjustment

Catch up on SAS Innovate 2026