Hi Guys,
We´re using Dataflux Data Management 2.6 to make a join sequence (Left Join) for 8 Fixed Width files (see the JOB Image attached):
File 1 (VINCULOS) 72 Million Lines (232 bytes each) - The Base File
File 2 (REMUNE) 600 Million Lines (48 bytes each)
File 3 (DECIMO) 100 Million Lines (42 bytes each)
File 4 (HORAS) 190 Million Lines (34 bytes each)
File 5 (CONTRI) 80 Million Lines (60 bytes each)
File 6 (AFASTA) 8 Million Lines (88 bytes each)
File 7 (DESLIGA) 72 Million Lines (116 bytes each)
File 8 (TRABAL) 72 Million Lines (164 bytes each)
Files from 2 to 5 are submitted to a sort and a Surviving Record Identification, that will insure a maximum of 72 million lines for each one, before the Join.
As you can see in the images attached:
- The Expresion red nodes are log counters for each 1 million lines read,
- The orange nodes are for Sorting,
- The Expression green nodes creates a few attributes,
- The Surviving grey nodes uses Filed Rules based on "not Null" for some attributes, with no previous output,
- The Blue Data Joining nodes are one based attribute Left Joins, with no memory load option,
- The Output Fixed File Purple Node has 72 million lines, with 785 bytes each
Using that design, the job runs in 5.5 days in a Dataflux Data Management Server with 128 GB RAM, 20 cores and 13 TB free of disk.
We´re using, as primary definition, the server default values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS.
So, I´d like to ask:
- Is that successive join design the right approach to this situation? Separating the Joins and creating intermediate files would be a better performance solution?
- Based on the server configuration, what values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS should I may consider?
- Any suggestions to improve performance for this Job?
Thank you in advance for any contribution to improving the performance of this JOB.
Luis Leite
HI - here are a couple of things to consider:
Ron
HI - here are a couple of things to consider:
Ron
Hi, Ron.
Thank´s a lot for your guidance!
We´ve been using BYTES as units in memory options. I´ll check them.
The Joins and Sorts are the most time-consuming nodes, so I´ll try some adjustments based on your advice.
Luis Leite
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.