BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LuisLeiteBR
Obsidian | Level 7

Hi Guys,

 

We´re using Dataflux Data Management 2.6 to make a join sequence (Left Join) for 8 Fixed Width files (see the JOB Image attached):

 

File 1 (VINCULOS) 72 Million Lines (232 bytes each) - The Base File
File 2 (REMUNE) 600 Million Lines (48 bytes each)
File 3 (DECIMO) 100 Million Lines (42 bytes each)
File 4 (HORAS) 190 Million Lines (34 bytes each)
File 5 (CONTRI) 80 Million Lines (60 bytes each)
File 6 (AFASTA) 8 Million Lines (88 bytes each)
File 7 (DESLIGA) 72 Million Lines (116 bytes each)
File 8 (TRABAL) 72 Million Lines (164 bytes each)

 

Files from 2 to 5 are submitted to a sort and a Surviving Record Identification, that will insure a maximum of 72 million lines for each one, before the Join.

 

As you can see in the images attached:
- The Expresion red nodes are log counters for each 1 million lines read,
- The orange nodes are for Sorting,
- The Expression green nodes creates a few attributes,
- The Surviving grey nodes uses Filed Rules based on "not Null" for some attributes, with no previous output,
- The Blue Data Joining nodes are one based attribute Left Joins, with no memory load option,
- The Output Fixed File Purple Node has 72 million lines, with 785 bytes each

 

Using that design, the job runs in 5.5 days in a Dataflux Data Management Server with 128 GB RAM, 20 cores and 13 TB free of disk.

 

We´re using, as primary definition, the server default values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS.

 

So, I´d like to ask:

 

- Is that successive join design the right approach to this situation? Separating the Joins and creating intermediate files would be a better performance solution?


- Based on the server configuration, what values for JOINBYTES, CHUNKSIZE, BASE/SORTBYTES, BASE/SORTTEMP and BASE/SORTTHREADS should I may consider?


- Any suggestions to improve performance for this Job?

 

Thank you in advance for any contribution to improving the performance of this JOB.

 

Luis Leite


JoinsSequence.PNGSurvivingSample.PNG
1 ACCEPTED SOLUTION

Accepted Solutions
RonAgresta
SAS Employee

HI - here are a couple of things to consider:

  • The memory defaults for sort and join nodes are really low and you have a lot of memory available. Make some estimates on the amount of RAM you need for your OS, subtract that from 128, and then divide the remainder across any memory options in your join, cluster, sort, and branch nodes. THe survivor node does not need a lot of memory unless you expect that you will have very large match clusters. Watch the units on the memory options. If I remember correctly, some are in bytes while at least one is in megabytes.
  • There is a log option you can set that will give you node profile metrics for every step in your job. Enabling this will help you determine what nodes in your job are taking the most time. You can then focus memory there.
  • You're performing a lot of disk I/O by reading from very large files and writing out a very large file. Preferably those files are on the same machine as DM Server/Studio but if they can't be, verify you have optimal network configurations and disk I/O settings.
  • Because all of the daa sets in your joins are relatively large, I'm not sure you'll gain much by attempting to load one side into memory. This normally works well when one side of the join is much smaller than the other.
  • Have you considered an alternate approach where you bulk load the data into a database and perform joins there in indexed columns? It may be worth a try even though you would need read data out of the database to perform the survivorship.

Ron

View solution in original post

2 REPLIES 2
RonAgresta
SAS Employee

HI - here are a couple of things to consider:

  • The memory defaults for sort and join nodes are really low and you have a lot of memory available. Make some estimates on the amount of RAM you need for your OS, subtract that from 128, and then divide the remainder across any memory options in your join, cluster, sort, and branch nodes. THe survivor node does not need a lot of memory unless you expect that you will have very large match clusters. Watch the units on the memory options. If I remember correctly, some are in bytes while at least one is in megabytes.
  • There is a log option you can set that will give you node profile metrics for every step in your job. Enabling this will help you determine what nodes in your job are taking the most time. You can then focus memory there.
  • You're performing a lot of disk I/O by reading from very large files and writing out a very large file. Preferably those files are on the same machine as DM Server/Studio but if they can't be, verify you have optimal network configurations and disk I/O settings.
  • Because all of the daa sets in your joins are relatively large, I'm not sure you'll gain much by attempting to load one side into memory. This normally works well when one side of the join is much smaller than the other.
  • Have you considered an alternate approach where you bulk load the data into a database and perform joins there in indexed columns? It may be worth a try even though you would need read data out of the database to perform the survivorship.

Ron

LuisLeiteBR
Obsidian | Level 7

Hi, Ron.

Thank´s a lot for your guidance!

We´ve  been using BYTES as units in memory options. I´ll check them.

The Joins and Sorts are the most time-consuming nodes, so I´ll try some adjustments based on your advice.

 

Luis Leite

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1568 views
  • 2 likes
  • 2 in conversation