At the last Pentaho London Usergroup we had a presentation about building plugins with PDI. You can watch it here (All PLUG presentations are published online)
Despite technical woes with the projector and the environment the one thing that came across is that if you’re a BI dev rather than a Java dev then this is pretty tricky. One issue is that there is no clear source of documentation – Some of the older stuff is still relevant but it’s unclear as to whether or not this is the best or “right” way. However this is improving as of 5.3 because the PDI SDK is now officially supported by Pentaho.
So yesterday I built a step plugin. As I’ve said before, I’m not a Java dev, but this is what I did, and maybe this will help someone. And If there is a better way, maybe someone can point that out too!
Before I start, why did I build a step plugin?
Well @rolandbouman (Pentaho Dev and js/MySQL wizard) said:
And that is all correct – It’s just actually not why I’m doing it. My reason is performance. Something I’ve been banging my head against for the last year! I have a particular calculation that uses data from 13 columns in the stream and adds a further 2 columns. I’ve tried various combinations to do this in native steps but it just doesn’t perform when you use the “calculator” and other steps. (We’re talking billions of rows here so performance is critical). So we wrote it as a UDJC and whoosh! it flew!
In a separate discussion a colleague had mentioned that a UDJC was still not as fast as a native plugin. Even if you exclude the time to compile at the start there is apparently additional work to convert data between proper java data types in PDI and data in Janino. (I do not know if I have described this right) So he stated that a native plugin would be faster. I decided to try this out as I didn’t believe there would be much in it. In the end in a simple test case I saw approximately 5-12% performance improvement, so it’s not a panacea, but it is worth having.
Firstly, the UDJC code, you can find the transformation in github here
The key points in this bit of code are how we get the input fields (once) and that is like so:
p6Field = get(Fields.In, "PROBABILITY_6");
And then we get the actual value row by row like so:
Pretty simple. And this ran at an average of 1,172,252 r/s over 3 runs.
So how do we build a plugin? Well like so:
- Download eclipse or similar (There are thousands of guides to this online)
- Checkout the dummy plugin example from SVN
- Run the ant script and check it builds as-is, it should do. Fix any issues here before doing anything else
- Then you have 2 options. You can start hacking that very plugin (which I did) or you can start renaming the package first etc.
- To deploy simply copy the DummyPluginForDeploy folder into PDI in the plugins/steps folder. (You’ll notice it’s already there actually, so you can just update the jar). Remember to restart PDI.
Now; Coding a plugin is a bit different to UDJC as you don’t necessarily have the same helpers available. So; I used the following approach to find the relevant field in the incoming stream: (remember you only need to do this once)
pindexs = getInputRowMeta().indexOfValue( "PROBABILITY_0" );
This gives you the index in the row object of your field.
Then you can access that field like so:
score = (Double)r[pindexs]*(Double)r[pindexs]
Etc.. Note we have a lot of hard coding going on here. I KNOW the incoming field name, and I know it’s a Double. But you could easily configure those as parameters/config values in your step gui. Finally we have to add our score to the result and that is done as so:
putRow(data.outputRowMeta, RowDataUtil.addValueData(r, scoreindex, score));
And this is how the demo example in the SDK does it, so I’m assuming that bit is definitely right!
- Remember to only do stuff once where possible (i.e. interrogating the stream structure).
- Re-use variables, don’t declare new ones every row
- Don’t log every row. Definitely don’t do that
- Only add fields to the output you really need. While debugging you may have additional fields but I recommend you comment these out once you have a fully working version.
I suppose a final point should be made that you can very easily write code in a UDJC/Plugin which performs terribly. Your first code will do. So you need a really solid understanding of PDI internals, and good skills with stackoverflow. Please don’t take this blog post to mean, ooh i can re-write that step in java and immediately solve all my problems. And don’t forget, maintaining a block of code is much harder than a couple of steps.