How to sort data faster in PDI

As anyone who follows my previous blogs will know I’ve been involved in a project recently involving billions of rows and where PDI is processing over a Trillion records step to step.

(Massively Simplified) Part of that includes one major sort.  And for various reasons it’s done in PDI hey ho.

So, how do I make sort faster?

Simple: Reduce the heap, use more temp files on disk.

Errr, what?!  Typo?  increase surely?

Nope. Here’s my setup:
Scenario 1: Sort a bunch of data.  Set heap to 48GB, and monitor tmp disk space. PDI uses 77GB of temp space, and it takes 8 hours.
Scenario 2: Look at the above and think, ooh; Don’t use tmp space, give PDI 77+48 heap. surely it’ll be faster?  Sort in memory, no brainer.  EVERYONE is talking in memory these days.  (And for the last 5 years)  Err, no, 19 hours. OUCH.
The reason is the enormous cost of garbage collection in the second process.  (And that’s with the concurrent garbage collector too!)  On a 32 cpu box I see hours for a stretch where only 1 cpu is being used.  Then suddenly it goes crazy, and then stops again.
Perhaps the different code path PDI uses when using tmp files to sort results in more efficient object usage?
Now; Our disk is SSD so in scenario 1 the impact of using tmp files is not as bad as it would normally be.  I had pondered on setting up a ~77gb ramdisk but I’m guessing any improvements would be very minor.  (I hardly ever see utilisation go up on the SSD itself)
Java8 has some VM enhancements specifically around sorting – I wonder what would it take for PDI to start using those features?  That’s assuming support for Java8 is added at all!
Happy Friday!

Leave a comment