Hi Guys, We´re using Dataflux Data Management 2.6 to make a join sequence (Left Join) for 8 Fixed Width files (see the JOB Image attached): File 1 (VINCULOS) 72 Million Lines (232 bytes each) - The Base File File 2 (REMUNE) 600 Million Lines (48 bytes each) File 3 (DECIMO) 100 Million Lines (42 bytes each) File 4 (HORAS) 190 Million Lines (34 bytes each) File 5 (CONTRI) 80 Million Lines (60 bytes each) File 6 (AFASTA) 8 Million Lines (88 bytes each) File 7 (DESLIGA) 72 Million Lines (116 bytes each) File 8 (TRABAL) 72 Million Lines (164 bytes each) Files from 2 to 5 are submitted to a sort and a Surviving Record Identification, that will insure a maximum of 72 million lines for each one, before the Join. As you can see in the images attached: - The Expresion red nodes are log counters for each 1 million lines read, - The orange nodes are for Sorting, - The Expression green nodes creates a few attributes, - The Surviving grey nodes uses Filed Rules based on "not Null" for some attributes, with no previous output, - The Blue Data Joining nodes are one based attribute Left Joins, with no memory load option, - The Output Fixed File Purple Node has 72 million lines, with 785 bytes each Using that design, the job runs in 5.5 days in a Dataflux Data Management Server with 128 GB RAM, 20 cores and 13 TB free of disk. We´re using, as primary definition, the server default values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS. So, I´d like to ask: - Is that successive join design the right approach to this situation? Separating the Joins and creating intermediate files would be a better performance solution? - Based on the server configuration, what values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS should I may consider? - Any suggestions to improve performance for this Job? Thank you in advance for any contribution to improving the performance of this JOB. Luis Leite
... View more