Back in the past I worked as a dedicated performance tester on several OLTP systems, and recently have spent quite a lot of time working performance testing/tuning PDI (Pentaho Data Integration)
The one thing to remember about performance related work, is never underestimate how long it takes. Doesn’t matter what the technology is either. This is complex stuff and takes a seriously long amount of time. (Hence one of the reasons why a performance tester is paid more than a senior tester, and is deemed a separate skill area). Also; Set an end goal. In a complex system you could tune forever. So decide on a line in the sand and stop once you’ve reached it.
As with any system performance testing PDI has its own set of challenges. There are a lot of ways to go about this, but here are some notes on a quick bit of step benchmarking
But Why? Well a colleague said to me “why is your get variables step inline rather than a join – a join would be faster”. So lets see if thats true.. Actually lets just tell you now – It’s not. The inline approach is faster. How did I test this?
- Benchmark “Generate rows” Step – to prove what speed we can source and write data. Note: It is important to write it somewhere (e.g. a dummy) otherwise you’re not testing the step in its entirety.
- Benchmark “Get variables” Step inline. Run 3 times and take the average. Run for a number of rows that takes a good 30s+ to run so that process init time becomes neglible. Just to add to the work PDI has to do make sure the variable is converted to an Integer.
- Benchmark with join rows approach
- Benchmark with getting 5 variables – both ways.
And the results:
- 2,374,000 records per second
- 1,528,000 r/s
- 1,316,000 r/s
- 1,491,000 r/s
- 1,300,000 r/s
So you can see the “join” approach (whether 1 variable or 5) only performs at 86% of the speed of simply using the step inline. I’m not sure if this has always been the case, but it’s true right now with PDI 5.4. My hunch is that this is nothing to do with the join per-se, but probably more closely related to the fact that there are more hops.
Now clearly this is just a benchmark. It’s a best case. It’s unlikely your incoming data is coming in at 2M r/s so therefore changing to inline probably wont help you. But if you’re CPU starved then it can’t hurt either. And as always with performance testing there’s a lot of differences, so what works today, may not work optimally tomorrow in a different scenario!
One day we’ll get support for using variables in filter rows, and also the variable substitution in calculator will work, and then there’ll be a lot less get variables steps anyway!
Here’s a shot of the simple transformation, in the lovely new skin of 5.4. Also note the use of canvas notes to document the performance scenarios.