How to sort data faster in PDI

As anyone who follows my previous blogs will know I’ve been involved in a project recently involving billions of rows and where PDI is processing over a Trillion records step to step.

(Massively Simplified) Part of that includes one major sort.  And for various reasons it’s done in PDI hey ho.

So, how do I make sort faster?

Simple: Reduce the heap, use more temp files on disk.

Errr, what?!  Typo?  increase surely?

Nope. Here’s my setup:
Scenario 1: Sort a bunch of data.  Set heap to 48GB, and monitor tmp disk space. PDI uses 77GB of temp space, and it takes 8 hours.
Scenario 2: Look at the above and think, ooh; Don’t use tmp space, give PDI 77+48 heap. surely it’ll be faster?  Sort in memory, no brainer.  EVERYONE is talking in memory these days.  (And for the last 5 years)  Err, no, 19 hours. OUCH.
The reason is the enormous cost of garbage collection in the second process.  (And that’s with the concurrent garbage collector too!)  On a 32 cpu box I see hours for a stretch where only 1 cpu is being used.  Then suddenly it goes crazy, and then stops again.
Perhaps the different code path PDI uses when using tmp files to sort results in more efficient object usage?
Now; Our disk is SSD so in scenario 1 the impact of using tmp files is not as bad as it would normally be.  I had pondered on setting up a ~77gb ramdisk but I’m guessing any improvements would be very minor.  (I hardly ever see utilisation go up on the SSD itself)
Java8 has some VM enhancements specifically around sorting – I wonder what would it take for PDI to start using those features?  That’s assuming support for Java8 is added at all!
Happy Friday!

Building Step Plugins with PDI / Pentaho Kettle

At the last Pentaho London Usergroup we had a presentation about building plugins with PDI.  You can watch it here (All PLUG presentations are published online)

https://skillsmatter.com/skillscasts/6105-building-plugins-for-pdi-kettle

Despite technical woes with the projector and the environment the one thing that came across is that if you’re a BI dev rather than a Java dev then this is pretty tricky.  One issue is that there is no clear source of documentation – Some of the older stuff is still relevant but it’s unclear as to whether or not this is the best or “right” way.  However this is improving as of 5.3 because the PDI SDK is now officially supported by Pentaho.

So yesterday I built a step plugin.  As I’ve said before, I’m not a Java dev, but this is what I did, and maybe this will help someone. And If there is a better way, maybe someone can point that out too!

Before I start, why did I build a step plugin?

Well @rolandbouman (Pentaho Dev and js/MySQL wizard) said:

And that is all correct – It’s just actually not why I’m doing it.  My reason is performance.  Something I’ve been banging my head against for the last year!  I have a particular calculation that uses data from 13 columns in the stream and adds a further 2 columns.  I’ve tried various combinations to do this in native steps but it just doesn’t perform when you use the “calculator” and other steps.  (We’re talking billions of rows here so performance is critical).  So we wrote it as a UDJC and whoosh! it flew!

In a separate discussion a colleague had mentioned that a UDJC was still not as fast as a native plugin. Even if you exclude the time to compile at the start there is apparently additional work to convert data between proper java data types in PDI and data in Janino.  (I do not know if I have described this right)  So he stated that a native plugin would be faster. I decided to try this out as I didn’t believe there would be much in it.  In the end in a simple test case I saw approximately 5-12% performance improvement, so it’s not a panacea, but it is worth having.

Firstly, the UDJC code, you can find the transformation in github here

The key points in this bit of code are how we get the input fields (once) and that is like so:

p6Field = get(Fields.In, "PROBABILITY_6");

And then we get the actual value row by row like so:

p0Field.getNumber(r)

Pretty simple.  And this ran at an average of 1,172,252 r/s over 3 runs.

So how do we build a plugin?  Well like so:

  1. Download eclipse or similar (There are thousands of guides to this online)
  2. Checkout the dummy plugin example from SVN
  3. Run the ant script and check it builds as-is, it should do. Fix any issues here before doing anything else
  4. Then you have 2 options. You can start hacking that very plugin (which I did) or you can start renaming the package first etc.
  5. To deploy simply copy the DummyPluginForDeploy folder into PDI in the plugins/steps folder. (You’ll notice it’s already there actually, so you can just update the jar). Remember to restart PDI.

Now; Coding a plugin is a bit different to UDJC as you don’t necessarily have the same helpers available.  So; I used the following approach to find the relevant field in the incoming stream: (remember you only need to do this once)

pindexs[0] = getInputRowMeta().indexOfValue( "PROBABILITY_0" );

This gives you the index in the row object of your field.

Then you can access that field like so:

score = (Double)r[pindexs[0]]*(Double)r[pindexs[1]]

Etc.. Note we have a lot of hard coding going on here. I KNOW the incoming field name, and I know it’s a Double.  But you could easily configure those as parameters/config values in your step gui.  Finally we have to add our score to the result and that is done as so:

putRow(data.outputRowMeta, RowDataUtil.addValueData(r, scoreindex, score));

And this is how the demo example in the SDK does it, so I’m assuming that bit is definitely right!

And that’s it.  The same things that help you performance tune a Javascript step apply to a UDJC and a plugin:

  1. Remember to only do stuff once where possible (i.e. interrogating the stream structure).
  2. Re-use variables, don’t declare new ones every row
  3. Don’t log every row. Definitely don’t do that
  4. Only add fields to the output you really need.  While debugging you may have additional fields but I recommend you comment these out once you have a fully working version.

I suppose a final point should be made that you can very easily write code in a UDJC/Plugin which performs terribly.  Your first code will do.  So you need a really solid understanding of PDI internals, and good skills with stackoverflow.  Please don’t take this blog post to mean, ooh i can re-write that step in java and immediately solve all my problems.  And don’t forget, maintaining a block of code is much harder than a couple of steps.