Hi Guys,
We´re using Dataflux Data Management 2.6 to make a join sequence (Left Join) for 8 Fixed Width files (see the JOB Image attached):
File 1 (VINCULOS) 72 Million Lines (232 bytes each) - The Base File
File 2 (REMUNE) 600 Million Lines (48 bytes each)
File 3 (DECIMO) 100 Million Lines (42 bytes each)
File 4 (HORAS) 190 Million Lines (34 bytes each)
File 5 (CONTRI) 80 Million Lines (60 bytes each)
File 6 (AFASTA) 8 Million Lines (88 bytes each)
File 7 (DESLIGA) 72 Million Lines (116 bytes each)
File 8 (TRABAL) 72 Million Lines (164 bytes each)
Files from 2 to 5 are submitted to a sort and a Surviving Record Identification, that will insure a maximum of 72 million lines for each one, before the Join.
As you can see in the images attached:
- The Expresion red nodes are log counters for each 1 million lines read,
- The orange nodes are for Sorting,
- The Expression green nodes creates a few attributes,
- The Surviving grey nodes uses Filed Rules based on "not Null" for some attributes, with no previous output,
- The Blue Data Joining nodes are one based attribute Left Joins, with no memory load option,
- The Output Fixed File Purple Node has 72 million lines, with 785 bytes each
Using that design, the job runs in 5.5 days in a Dataflux Data Management Server with 128 GB RAM, 20 cores and 13 TB free of disk.
We´re using, as primary definition, the server default values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS.
So, I´d like to ask:
- Is that successive join design the right approach to this situation? Separating the Joins and creating intermediate files would be a better performance solution?
- Based on the server configuration, what values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS should I may consider?
- Any suggestions to improve performance for this Job?
Thank you in advance for any contribution to improving the performance of this JOB.
Luis Leite
HI - here are a couple of things to consider:
Ron
HI - here are a couple of things to consider:
Ron
Hi, Ron.
Thank´s a lot for your guidance!
We´ve been using BYTES as units in memory options. I´ll check them.
The Joins and Sorts are the most time-consuming nodes, so I´ll try some adjustments based on your advice.
Luis Leite
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.